Scott Edmunds: A new publishing workflow for rapid dissemination of genomes u...GigaScience, BGI Hong Kong
Scott Edmunds on a new publishing workflow for rapid dissemination of genomes using GigaByte & GigaDB. Presented at Biodiversity 2020 in the Annotation & Databases track, 9th October 2020.
The original abstract for the talk is below BUT the talk changed based on a big interest in InChI and the possibilities to use in a Semantic Web for Chemistry
The increasing availability of free and open access resources for scientists on the internet presents us with a revolution in data availability. However, freedom costs and in many cases the cost is quality. ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. As an aggregator of chemistry related information from many sources, at present over 21.5 million unique chemical entities from over 150 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. This presentation will provide an overview of how a curated platform can become the centralized hub for resourcing information about chemical entities. We will also present ChemMantis, an entity extraction platform for extracting chemical names and scientific terms in documents and providing a platform for structure-based searching of Open Access chemistry literature.
Scott Edmunds: A new publishing workflow for rapid dissemination of genomes u...GigaScience, BGI Hong Kong
Scott Edmunds on a new publishing workflow for rapid dissemination of genomes using GigaByte & GigaDB. Presented at Biodiversity 2020 in the Annotation & Databases track, 9th October 2020.
The original abstract for the talk is below BUT the talk changed based on a big interest in InChI and the possibilities to use in a Semantic Web for Chemistry
The increasing availability of free and open access resources for scientists on the internet presents us with a revolution in data availability. However, freedom costs and in many cases the cost is quality. ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. As an aggregator of chemistry related information from many sources, at present over 21.5 million unique chemical entities from over 150 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. This presentation will provide an overview of how a curated platform can become the centralized hub for resourcing information about chemical entities. We will also present ChemMantis, an entity extraction platform for extracting chemical names and scientific terms in documents and providing a platform for structure-based searching of Open Access chemistry literature.
Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing p...GigaScience, BGI Hong Kong
Jesse Xiao at the Data Publishing session at CODATA2017: Updates to the GigaDB open access data publishing platform. Wednesday 11th October in St Petersburg, Russia
A description of BRISSKit, an open source tool that may be used to combine datasets held in different locations and analyse them for the purpose of research. Talk give by Jonathan Tedds of Leicester Uni. for the Data Management in Practice workshop, which took place on Nov 14th 2013 at the London School of Hygiene and Tropical Medicine
There is an increasing availability of free and open access resources for scientists to use on the internet. Coupled with the increasing availability of Open Source software tools we are in the middle of a revolution in data availability and tools to manipulate these data. ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. As an aggregator of chemistry related information from many sources, at present over 21.5 million unique chemical entities from over 200 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. This presentation will provide an overview of the ChemSpider platform and how it is fast becoming the centralized hub for resourcing information about chemical entities.
Open Science is a movement to make scientific research, its data and dissemination accessible to all levels of society. This movement considers aspects such as Open Access, Open Data, Reproducible Research and Open Software.
Each of these aspects presents discreteness that need to be evaluated and discussed by the scientific community so that guidelines are established that facilitate the dissemination of scientific information.
The great challenge is to establish effective and efficient practices that allow journals to add these demands in their editorial processes, so as not only to allow data, software and methods to be accessible, but also to encourage the community to do so.
Considering these questions, this panel has as a proposal to discuss important aspects about the advancement of research communication. Some of these aspects are placed in the SciELO indexing criteria, as is the case of referencing research materials in favor of transparency and reproducibility.
Syllabus
FAIR criteria, concepts and implementation; challenges for the publication of data and methods; institutional policies for open data; adoption of TOP guidelines (Transparency and Openness Promotion); software repositories; thematic areas data repositories.
ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. It was developed to index available sources of chemical structures and their associated data into a single searchable repository and making it available to everybody, at no charge. While there are a large number of databases containing chemical compounds and data available online their inherent quality, accuracy and completeness is severely lacking. ChemSpider has provided a platform so that the chemistry community could contribute to improving the quality of data online and expanding the information to include data such as reaction syntheses, analytical data, experimental properties and linkages to other valuable resources. It has grown into a resource containing over 21 million unique chemical structures from over 200 data sources.
This presentation will provide an overview of ChemSpider and its value to chemists as a search tool, as a public repository of information and how it can become one of the primary foundations of internet-based chemistry. I will also discuss the vision for ChemSpider and some of the lofty goals we are setting for the system moving forward.
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021dkNET
Abstract
Good data stewardship is the cornerstone of knowledge, discovery, and innovation in research. The FAIR Data Principles address data creators, stewards, software engineers, publishers, and others to promote maximum use of research data. The principles can be used as a framework for fostering and extending research data services.
This talk will provide an overview of the FAIR principles and the drivers behind their development by a broad community of international stakeholders. We will explore a range of topics related to putting FAIR data into practice, including how and where data can be described, stored, and made discoverable (e.g., data repositories, metadata); methods for identifying and citing data; interoperability of (meta)data; best-practice examples; and tips for enabling data reuse (e.g., data licensing). Practical examples of how FAIR is applied will be provided along the way.
Presenter: Christopher Erdmann, Engagement, support, and training expert on the NHLBI BioData Catalyst project at University of North Carolina Renaissance Computing Institute
dkNET Webinars Information: https://dknet.org/about/webinar
This presentation was written to deliver before the live overview of CHemSPider. The live presentation is captured here on video: http://www.chemspider.com/blog/my-full-presentation-at-drexel-university.html
Complexities in Open Access Discovery InterfacesMichael Habib
“It Isn’t ‘Open’ If You Can’t Find It: New Open Access Discovery Tools that Close the Gap between Readers and Open Content“, Speaker, Charleston Conference – November 9, 2017; Charleston, SC
Abstract: https://2017charlestonconference.sched.com/event/CHqR/it-isnt-open-if-you-cant-find-it-new-open-access-discovery-tools-that-close-the-gap-between-readers-and-open-content
Open PHACTS Explorer demonstration and talk given at SWAT4LS, Edinburgh, 2013. The Explorer is an Ember JS MVC web application used to navigate the Open PHACTS Linked Data Cache without requiring any knowledge of RDF or SPARQL
Online chemistry resources have expanded dramatically in the past few years with resources such as PubChem, ChEBI, Wikipedia, ChemSpider and many others offering rich resources to scientists seeking data and information. ChemSpider has become one of the primary chemistry portals delivering a heterogeneous mix of Open and Closed data. ChemSpider offers a structure-centric community for collaboration enabling the crowd-sourced deposition and validation of online chemistry data. ChemSpider has also been integrated into the ChemMantis system – CHEMistry Markup And Nomenclature Transformation Integrated System. This platform facilitates entity extraction of science related terms using both heuristics and highly curated dictionaries. The resulting documents are marked up to allow viewing of chemical structures linked out to over 200 different data sources via the ChemSpider database.
The ability to query across a chemistry publishers content using chemical structure searching can dramatically enhance discoverability. RSC has been applying a number of procedures to integrate RSC’s ChemSpider community resource with our published content and databases. These include: 1) entity extraction procedures 2) chemical name conversion procedures using software algorithms and curated dictionaries 3) semantic markup and 4) a crowdsourced curation processes. This presentation will provide an overview of the processes we have utilized in order to provide structure-based integration to RSC content. We will discuss our ongoing efforts to extend the approaches to the mining of data from the rich supplementary information sections of many RSC publications. Our intention is to provide access to synthesis procedures and analytical data and further enrich the ChemSpider database for the benefit of the chemistry community.
A keynote given on experiences in curating workflows and web services.
3rd International Digital Curation Conference: "Curating our Digital Scientific Heritage: a Global Collaborative Challenge"
11-13 December 2007
Renaissance Hotel
Washington DC, USA
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...SC CTSI at USC and CHLA
Date: Apr 4, 2018
Speaker: Hyoungjoo Park, PhD candidate, School of Information Studies, University of Wisconsin-Milwaukee, and Dietmar Wolfram, PhD
Overview: It is increasingly common for researchers to make their data freely available. This is often a requirement of funding agencies but also consistent with the principles of open science, according to which all research data should be shared and made available for reuse. Once data is reused, the researchers who have provided access to it should be acknowledged for their contributions, much as authors are recognised for their publications through citation. Hyoungjoo Park and Dietmar Wolfram have studied characteristics of data sharing, reuse, and citation and found that current data citation practices do not yet benefit data sharers, with little or no consistency in their format. More formalised citation practices might encourage more authors to make their data available for reuse.
Workshop - finding and accessing data - Cambridge August 22 2016Fiona Nielsen
Finding and accessing human genomic data for research
University of Cambridge, United Kingdom | Seminar Room G
Monday, 22 August 2016 from 10:00 to 12:00 (BST)
Charlotte, Nadia and Fiona presented an overview of data sources around the world where you can find genomics data for your research and gave examples of the data access application for dbGaP and EGA with specific details relevant for University of Cambridge researchers.
Jesse Xiao at CODATA2017: Updates to the GigaDB open access data publishing p...GigaScience, BGI Hong Kong
Jesse Xiao at the Data Publishing session at CODATA2017: Updates to the GigaDB open access data publishing platform. Wednesday 11th October in St Petersburg, Russia
A description of BRISSKit, an open source tool that may be used to combine datasets held in different locations and analyse them for the purpose of research. Talk give by Jonathan Tedds of Leicester Uni. for the Data Management in Practice workshop, which took place on Nov 14th 2013 at the London School of Hygiene and Tropical Medicine
There is an increasing availability of free and open access resources for scientists to use on the internet. Coupled with the increasing availability of Open Source software tools we are in the middle of a revolution in data availability and tools to manipulate these data. ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. As an aggregator of chemistry related information from many sources, at present over 21.5 million unique chemical entities from over 200 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. This presentation will provide an overview of the ChemSpider platform and how it is fast becoming the centralized hub for resourcing information about chemical entities.
Open Science is a movement to make scientific research, its data and dissemination accessible to all levels of society. This movement considers aspects such as Open Access, Open Data, Reproducible Research and Open Software.
Each of these aspects presents discreteness that need to be evaluated and discussed by the scientific community so that guidelines are established that facilitate the dissemination of scientific information.
The great challenge is to establish effective and efficient practices that allow journals to add these demands in their editorial processes, so as not only to allow data, software and methods to be accessible, but also to encourage the community to do so.
Considering these questions, this panel has as a proposal to discuss important aspects about the advancement of research communication. Some of these aspects are placed in the SciELO indexing criteria, as is the case of referencing research materials in favor of transparency and reproducibility.
Syllabus
FAIR criteria, concepts and implementation; challenges for the publication of data and methods; institutional policies for open data; adoption of TOP guidelines (Transparency and Openness Promotion); software repositories; thematic areas data repositories.
ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. It was developed to index available sources of chemical structures and their associated data into a single searchable repository and making it available to everybody, at no charge. While there are a large number of databases containing chemical compounds and data available online their inherent quality, accuracy and completeness is severely lacking. ChemSpider has provided a platform so that the chemistry community could contribute to improving the quality of data online and expanding the information to include data such as reaction syntheses, analytical data, experimental properties and linkages to other valuable resources. It has grown into a resource containing over 21 million unique chemical structures from over 200 data sources.
This presentation will provide an overview of ChemSpider and its value to chemists as a search tool, as a public repository of information and how it can become one of the primary foundations of internet-based chemistry. I will also discuss the vision for ChemSpider and some of the lofty goals we are setting for the system moving forward.
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021dkNET
Abstract
Good data stewardship is the cornerstone of knowledge, discovery, and innovation in research. The FAIR Data Principles address data creators, stewards, software engineers, publishers, and others to promote maximum use of research data. The principles can be used as a framework for fostering and extending research data services.
This talk will provide an overview of the FAIR principles and the drivers behind their development by a broad community of international stakeholders. We will explore a range of topics related to putting FAIR data into practice, including how and where data can be described, stored, and made discoverable (e.g., data repositories, metadata); methods for identifying and citing data; interoperability of (meta)data; best-practice examples; and tips for enabling data reuse (e.g., data licensing). Practical examples of how FAIR is applied will be provided along the way.
Presenter: Christopher Erdmann, Engagement, support, and training expert on the NHLBI BioData Catalyst project at University of North Carolina Renaissance Computing Institute
dkNET Webinars Information: https://dknet.org/about/webinar
This presentation was written to deliver before the live overview of CHemSPider. The live presentation is captured here on video: http://www.chemspider.com/blog/my-full-presentation-at-drexel-university.html
Complexities in Open Access Discovery InterfacesMichael Habib
“It Isn’t ‘Open’ If You Can’t Find It: New Open Access Discovery Tools that Close the Gap between Readers and Open Content“, Speaker, Charleston Conference – November 9, 2017; Charleston, SC
Abstract: https://2017charlestonconference.sched.com/event/CHqR/it-isnt-open-if-you-cant-find-it-new-open-access-discovery-tools-that-close-the-gap-between-readers-and-open-content
Open PHACTS Explorer demonstration and talk given at SWAT4LS, Edinburgh, 2013. The Explorer is an Ember JS MVC web application used to navigate the Open PHACTS Linked Data Cache without requiring any knowledge of RDF or SPARQL
Online chemistry resources have expanded dramatically in the past few years with resources such as PubChem, ChEBI, Wikipedia, ChemSpider and many others offering rich resources to scientists seeking data and information. ChemSpider has become one of the primary chemistry portals delivering a heterogeneous mix of Open and Closed data. ChemSpider offers a structure-centric community for collaboration enabling the crowd-sourced deposition and validation of online chemistry data. ChemSpider has also been integrated into the ChemMantis system – CHEMistry Markup And Nomenclature Transformation Integrated System. This platform facilitates entity extraction of science related terms using both heuristics and highly curated dictionaries. The resulting documents are marked up to allow viewing of chemical structures linked out to over 200 different data sources via the ChemSpider database.
The ability to query across a chemistry publishers content using chemical structure searching can dramatically enhance discoverability. RSC has been applying a number of procedures to integrate RSC’s ChemSpider community resource with our published content and databases. These include: 1) entity extraction procedures 2) chemical name conversion procedures using software algorithms and curated dictionaries 3) semantic markup and 4) a crowdsourced curation processes. This presentation will provide an overview of the processes we have utilized in order to provide structure-based integration to RSC content. We will discuss our ongoing efforts to extend the approaches to the mining of data from the rich supplementary information sections of many RSC publications. Our intention is to provide access to synthesis procedures and analytical data and further enrich the ChemSpider database for the benefit of the chemistry community.
A keynote given on experiences in curating workflows and web services.
3rd International Digital Curation Conference: "Curating our Digital Scientific Heritage: a Global Collaborative Challenge"
11-13 December 2007
Renaissance Hotel
Washington DC, USA
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...SC CTSI at USC and CHLA
Date: Apr 4, 2018
Speaker: Hyoungjoo Park, PhD candidate, School of Information Studies, University of Wisconsin-Milwaukee, and Dietmar Wolfram, PhD
Overview: It is increasingly common for researchers to make their data freely available. This is often a requirement of funding agencies but also consistent with the principles of open science, according to which all research data should be shared and made available for reuse. Once data is reused, the researchers who have provided access to it should be acknowledged for their contributions, much as authors are recognised for their publications through citation. Hyoungjoo Park and Dietmar Wolfram have studied characteristics of data sharing, reuse, and citation and found that current data citation practices do not yet benefit data sharers, with little or no consistency in their format. More formalised citation practices might encourage more authors to make their data available for reuse.
Workshop - finding and accessing data - Cambridge August 22 2016Fiona Nielsen
Finding and accessing human genomic data for research
University of Cambridge, United Kingdom | Seminar Room G
Monday, 22 August 2016 from 10:00 to 12:00 (BST)
Charlotte, Nadia and Fiona presented an overview of data sources around the world where you can find genomics data for your research and gave examples of the data access application for dbGaP and EGA with specific details relevant for University of Cambridge researchers.
The Diversity of Biomedical Data, Databases and Standards (Research Data Alli...Peter McQuilton
A 10 minute presentation given in Denver (CO) on the 15th September as part of the IG Elixir Bridging Force, WG Biosharing Registry,WG Data Type Registries,WG Metadata Standards Catalog joint session of the Research Data Alliance 8th Plenary (part of International Data Week).
This presentation covers the proliferation of data, databases, and data standards in biomedicine, and how BioSharing can help inform and educate users on this landscape and relationships between data, databases and data standards.
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
How to make your published data findable, accessible, interoperable and reusablePhoenix Bioinformatics
Seminar Presentation for PMB Department, UC Berkeley for Love Data Week. Subject is how to prepare publications and associated data sets for maximum reuse.
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET
Abstract
In this presentation, Susan Gregurick, Ph.D., Associate Director of Data Science and Director, Office of Data Science Strategy at the National Institutes of Health, will share the NIH’s vision for a modernized, integrated FAIR biomedical data ecosystem and the strategic roadmap that NIH is following to achieve this vision. Dr. Gregurick will highlight projects being implemented by team members across the NIH’s 27 institutes and centers and will ways that industry, academia, and other communities can help NIH enable a FAIR data ecosystem. Finally, she will weave in how this strategy is being leveraged to address the COVID-19 pandemic.
Presenter: Susan Gregurick, Ph.D., Associate Director of Data Science and Director, Office of Data Science Strategy at the National Institutes of Health
dkNET Webinar Information: https://dknet.org/about/webinar
Lecture for a course at NTNU, 27th January 2021
CC-BY 4.0 Dag Endresen https://orcid.org/0000-0002-2352-5497
See also http://bit.ly/biodiversityinformatics
https://www.gbif.no/events/2021/lecture-ntnu-gbif.html
BioPharma and FAIR Data, a Collaborative AdvantageTom Plasterer
The concept of FAIR (Findable, Accessible, Interoperable and Reusable) data is becoming a reality as stakeholders from industry, academia, funding agencies and publishers are embracing this approach. For BioPharma being able to effectively share and reuse data is a tremendous competitive advantage, within a company, with peer organizations, key opinion leaders and regulatory agencies. A few key drivers, success stories and preliminary results of an industry data stewardship survey are presented.
Laurie Goodman at #aibsdata: Beyond Data Release Mandates - Helping Authors M...GigaScience, BGI Hong Kong
Laurie Goodman at the AIBS Changing Practices in Data Pub workshop: Beyond Data Release Mandates - Helping Authors Make Data Available. 3rd December 2014
Access the webinar: http://goo.gl/p08pTz
These slides were presented in a webinar by Denodo in collaboration with BioStorage Technologies and Indiana Clinical and Translational Sciences Institute and Regenstrief Institute.
BioStorage Technologies, Inc., Indiana Clinical and Translational Sciences Institute, and Regenstrief Institute (CTSI) have joined Denodo to talk about the important role of technological advancements, such as data virtualization, in advancing biospecimen research.
By watching this webinar, you can gain insight into best practices around the integration of biospecimen and research data as well as technology solutions that provide consolidated views and rapid conversions of this data into valuable business insights. You will also learn how data virtualization can assist with the integration of data residing in heterogeneous repositories and can securely deliver aggregated data in real-time.
Data Communities - reusable data in and outside your organization.Paul Groth
Description
Data is a critical both to facilitate an organization and as a product. How can you make that data more usable for both internal and external stakeholders? There are a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data (re)use. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data. I put this in the context of the notion data communities that organizations can use to help foster the use of data both within your organization and externally.
Similar to Measuring richness. A RCT to quantify the benefits of metadata quality; Scott Edmunds (20)
IDW2022: A decades experiences in transparent and interactive publication of ...GigaScience, BGI Hong Kong
Scott Edmunds at International Data Week 2022: A decades experiences in transparent and interactive publication of FAIR data and software via an end-to-end XML publishing platform. 21st June 2022
GigaByte Chief Editor Scott Edmunds presents on how to prepare a data paper for the TDR and WHO sponsored call for data papers describing datasets on vectors of human diseases launched in Nov 2021. Presented at the GBIF webinar on 25th January 2022 and aimed at authors interested in submitting a manuscript submitted to the series.
STM Week: Demonstrating bringing publications to life via an End-to-end XML p...GigaScience, BGI Hong Kong
Scott Edmunds at the STM Week 2020 Digital Publishing seminar on Demonstrating bringing publications to life via an End-to-end XML publishing platform. 2nd December 2020
Scott Edmunds: Quantifying how FAIR is Hong Kong: The Hong Kong Shareability ...GigaScience, BGI Hong Kong
Scot Edmunds talk at CODATA2019 on Quantifying how FAIR is Hong Kong: The Hong Kong Shareability of Hong Kong University Research Experiment. 19th September 2019 in Beijing
Scott Edmunds talk at IARC: How can we make science more trustworthy and FAIR...GigaScience, BGI Hong Kong
Scott Edmunds talk at IARC, Lyon. How can we make science more trustworthy and FAIR? Principled publishing for more evidence based research. 8th July 2019
PAGAsia19 - The Digitalization of Ruili Botanical Garden Project: Production...GigaScience, BGI Hong Kong
A 3 part talk presented at PAG Asia 2019 in Shenzhen- The Digitalization of Ruili Botanical Garden Project: Production, Curation and Re-Use. Presented by Huan Liu (CNGB), Scott Edmunds (GigaScience) & Stephen Tsui (CUHK). 8th June 2019
Democratising biodiversity and genomics research: open and citizen science to...GigaScience, BGI Hong Kong
Scott Edmunds at the China National GeneBank Youth Biodiversity MegaData Forum: Democratising biodiversity and genomics research: open and citizen science to build trust and fill the data gaps. 18th December 2018
Ricardo Wurmus at #ICG13: Reproducible genomics analysis pipelines with GNU Guix. Presented at the GigaScience Prize Track at the International Conference on Genomics, Shezhen 26th October 2018
Paul Pavlidis at #ICG13: Monitoring changes in the Gene Ontology and their im...GigaScience, BGI Hong Kong
Paul Pavlidis talk at the #ICG13 GigaScience Prize Track: Monitoring changes in the Gene Ontology and their impact on genomic data analysis (GOtrack). Shenzhen, 26th October 2018
Stefan Prost at #ICG13: Genome analyses show strong selection on coloration, ...GigaScience, BGI Hong Kong
Stefan Prost presentation for the #ICG13 GigaScience Prize Track: Genome analyses show strong selection on coloration, morphological and behavioral phenotypes in birds-of-paradise. Shenzhen, 26th October, 2018
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...GigaScience, BGI Hong Kong
Lisa Johnson's talk at the #ICG13 GigaScience Prize Track: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes. Shenzhen, 26th October 2018
Reproducible method and benchmarking publishing for the data (and evidence) d...GigaScience, BGI Hong Kong
Scott Edmunds presentation on: Reproducible method and benchmarking publishing for the data (and evidence) driven era. The Silk Road Forensics Conference, Yantai, 18th September 2018
Mary Ann Tuli: What MODs can learn from Journals – a GigaDB curator’s perspec...GigaScience, BGI Hong Kong
Mary Ann Tuli's talk at the International Society of Biocuration meeting : What MODs can learn from Journals – a GigaDB curator’s perspective. Shanghai 9th April 2018
Laurie Goodman: Sharing and Reusing Cell Image Data, ASCB/EMBO 2017 Subgroup ...GigaScience, BGI Hong Kong
Laurie Goodman's pre-prepared slides for the Subgroup S Sharing and Reusing Cell Image Data session at the 2017 ASCB│EMBO meeting in Philadelphia. December 2017
Susanna Sansone's talk at the "Beyond Open" Knowledge Dialogues/Open Data Hong Kong event on research data, hosted at the Hong Kong Innocentre on Monday 20 November 2017.
Jie Zheng at #ICG12: PhenoSpD: an atlas of phenotypic correlations and a mult...GigaScience, BGI Hong Kong
Jie Zheng at the #ICG12 GigaScience Prize Track: PhenoSpD: an atlas of phenotypic correlations and a multiple testing correction for the human phenome. ICG12, Shenzhen, 26th October 2017
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Measuring richness. A RCT to quantify the benefits of metadata quality; Scott Edmunds
1. Measuring richness. A RCT to
quantify the benefits of
metadata quality
Scott Edmunds
DataCite APAC 2020
2. 8years
in numbers
765
papers
published
5,411authors
from
78 countries
1,575
institutions
We’ve published
46 TB of data
1,914
327,658files
5editors in
locations
across5
16time
zones
68+ years of editorial
experience
3data curators with
years of experience65+
20+
patents
and policy
documents
1,000+news articles
and blogs
Our contents is used
in
21,186+tweets
PROSE
Award1
innovation in
journal publishing
multidisciplinary
231
Data
Notes
&
datasets/
DataCite DOIs
Incentivising data sharing through data publication
4. Where data citation was in 2012:
1. Proven utility/potential user base.
2. Acceptance/inclusion by journals.
3. Data+Citation: inclusion in the references.
4. Tracking by citation indexes.
5. Usage of the metrics by the community…
✔
✔
✔
✗
✗
5. We still need to tell people to #CitetheDOI
Where data citation is in 2020?
6. Gioiosa S, Bolis M, Flati T, Massini A, Garattini E, Chillemi G, Fratelli M, Castrignanò T. Massive NGS
data analysis reveals hundreds of potential novel gene fusions in human cell lines. Gigascience. 2018
Oct 1;7(10). doi: 10.1093/gigascience/giy062.
https://doi.org/10.1093/gigascience/giy062
What we didn’t know in 2012: #DataCitationFail
e.g. this Paper:
Includes no DOI information in crossref metadata (stripped?):
Gioiosa S, Bolis M, Flati T, et al. Supporting data for "massive NGS data
analysis reveals hundreds of potential novel gene fusions in human cell
lines.” GigaScience Database . 2018; http://dx.doi.org/10.5524/100442
Cites this GigaDB dataset DOI in the references:
See Ref 37: https://api.crossref.org/v1/works/doi.org/10.1093/gigascience/giy062
No Citations then show up in EventData:
https://api.datacite.org/events?doi=10.5524/100442
7. Where metadata is in 2020?
1. Focus now on move from open to FAIR data
(metadata for reusability)
2. Data journals helping incentivize best practice
3. Google dataset search pushing value of
schema.org (structured metadata for
discoverability)
4. Event data & scholix pushing value of non-
proprietary (DataCite/CrossRef) citation data
5. New indexes, knowledge graphs and tools built
upon these richer data sources
8. Huge potential but are data producers
using/following it?
http://www.metadata2020.org/
9. GigaScience: adding value (work)
Minimal DataCite:
Title
Author names
Publisher details
Release date
Resource type
Language
Additional DataCite:
ORCID IDs
Keywords
Funder details
Size of dataset
License
Description
Relationship info
Dataset specific:
Reporting checklist attributes
Location
Specimen details
Phenotypic info
Related accessions
Discoverability Reusability
+
(Discoverability)
11. Is this worth the effort?
Follow the medical community approach: Randomized Control Trial?
Pyramid of evidence?
RCTs
Cohort studies
Case-control studies
Case reports, qualitative research
12. • 1st Phase, proof of concept for 10KP
• The 1st digitalized botanical garden
• Show the biodiversity and phyletic evolution and
interactions between environment, ecosystem, and
evolution
• HT species identification & build CNGB Herbarium
• Results of phase 1 published in GigaScience
1,093 Samples
1093 Voucher
Specimen
49 Order
137 Family
761 Deep-
sequenced
689
Vascular
Species
54TB Data
DRBG
“Digitization of Ruili Botanical Garden”
Finding an example to study
1093 specimens & 54TB of data
Ruili Botanical
Garden
16. Does rich metadata increase discoverability? Testing with RCT
https://osf.io/wzps8/
17. Does rich metadata increase discoverability? Testing with RCT
https://osf.io/wzps8/
HDC1 – High data content, full
DataCite metadata, n=8
LDC1 – Low data content, full
DataCite metadata, n=545
HDC2 – High data content, minimal
DataCite metadata, n=9
LDC2 – Low data content, minimal
DataCite metadata, n=531
High data content (HDC set, n=17)
Low data content (LDC set, n=1076)
=RANDBETWEEN
COHORT
Rich metadata set Poor metadata set
Wait 12 months
Any difference in metrics? (visits, downloads, citations…)
18. Does rich metadata increase discoverability? Testing with RCT
Any difference in metrics? (visits, downloads, citations…)
• The total number of unique page views for ALL 1093 Ruili
individual datasets is 504 over the year (0.46 views per
dataset)
• Equivalent datasets (individual genomes for bird & orphan
crop genome projects) that are NOT Ruili datasets over the
same period received 4473 unique page hits (44.7/dataset)
• Rich metadata datasets received on average 0.438
hits/dataset/year
• Poor metadata datasets received on average 0.485
hits/dataset/year
✗FAIL Didn’t work/underpowered (very low access stats)
19. Does rich metadata increase discoverability? Testing with RCT
✗FAIL Lessons learned for future RCTs
• Unidentified species not a great use case for discoverability
• Quick and dirty approach to RCT doesn't work. Need a wider spectrum of more
popular datasets and a bigger sample size
• Trying to compare historical usage is tricky, need better matching of
comparison groups. Datasets ideally need to be released at the same time to
account for calendar differences and usage spikes
• Need to test with databases with higher accesses/turnover, which could do
this with hundreds of random generic datasets published within a short
timeframe, and randomly assigned into minimal vs enhanced metadata groups
• CrossRef RCTs would probably work better than DataCite (more users)
20. Does rich metadata increase discoverability? Testing with RCT
See our experiment https://osf.io/wzps8/
TO DO METADATA EXPERIMENTS
21. Thanks to:
Laurie Goodman, Publisher
Nicole Nogoy, Editor
Hans Zauner, Assistant Editor
Hongling Zhao, Assistant Editor
Peter Li, Head of IT
Chris Hunter, Lead BioCurator
Chris Armit, Data Scientist
Mary Ann Tulli, Data Editor
Rija Ménagé, Senior Software Engineer
Ken Cho, Systems Programmer Analyst
Chen Qi, Shenzhen Office.
Jesse Xiao (now at HKU)
Follow us:
https://gigabytejournal.com/
Submit to our new GigaByte Journal, free APCs till 28th Feb 2021
@GigaByteJournal
facebook.com/GigaByteJournal
http://gigasciencejournal.com/blog/
editorial@gigabytejournal.com
Editor's Notes
Includes sample metadata (in database only, not DataCite) and cross-species results (gene alignments & trees)