Seminar Presentation for PMB Department, UC Berkeley for Love Data Week. Subject is how to prepare publications and associated data sets for maximum reuse.
The Center for Expanded Data Annotation and Retrieval (CEDAR) has developed a suite of tools and services that allow scientists to create and publish metadata describing scientific experiments. Using these tools and services—referred to collectively as the CEDAR Workbench—scientists can collaboratively author metadata and submit them to public repositories. A key focus of our software is semantically enriching metadata with ontology terms. The system combines emerging technologies, such as JSON-LD and graph databases, with modern software development technologies, such as microservices and container platforms. The result is a suite of user-friendly, Web-based tools and REST APIs that provide a versatile end-to-end solution to the problems of metadata authoring and management. This talk presents the architecture of the CEDAR Workbench and focuses on the technology choices made to construct an easily usable, open system that allows users to create and publish semantically enriched metadata in standard Web formats.
The Center for Expanded Data Annotation and Retrieval (CEDAR) has developed a suite of tools and services that allow scientists to create and publish metadata describing scientific experiments. Using these tools and services—referred to collectively as the CEDAR Workbench—scientists can collaboratively author metadata and submit them to public repositories. A key focus of our software is semantically enriching metadata with ontology terms. The system combines emerging technologies, such as JSON-LD and graph databases, with modern software development technologies, such as microservices and container platforms. The result is a suite of user-friendly, Web-based tools and REST APIs that provide a versatile end-to-end solution to the problems of metadata authoring and management. This talk presents the architecture of the CEDAR Workbench and focuses on the technology choices made to construct an easily usable, open system that allows users to create and publish semantically enriched metadata in standard Web formats.
The metadata about scientific experiments are crucial for finding, reproducing, and reusing the data that the metadata describe. We present a study of the quality of the metadata stored in BioSample—a repository of metadata about samples used in biomedical experiments managed by the U.S. National Center for Biomedical Technology Information (NCBI). We tested whether 6.6 million BioSample metadata records are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the analyzed metadata. The BioSample metadata field names and their values are not standardized or controlled—15% of the metadata fields use field names not specified in the BioSample data dictionary. Only 9 out of 452 BioSample-specified fields ordinarily require ontology terms as values, and the quality of these controlled fields is better than that of uncontrolled ones, as even simple binary or numeric fields are often populated with inadequate values of different data types (e.g., only 27% of Boolean values are valid). Overall, the metadata in BioSample reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The aberrancies in the metadata are likely to impede search and secondary use of the associated datasets.
The availability of high-quality metadata is key to facilitating discovery in the large variety of scientific datasets that are increasingly becoming publicly available. However, despite the recent focus on metadata, the diversity of metadata representation formats and the poor support for semantic markup typically result in metadata that are of poor quality. There is a pressing need for a metadata representation format that provides strong interoperation capabilities together with robust semantic underpinnings. In this talk, we describe such a format, together with open-source Web-based tools that support the acquisition, search, and management of metadata. We outline an initial evaluation using metadata from a variety of biomedical repositories.
The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to revolutionize the way that metadata describing scientific experiments are authored. The software we have developedthe CEDAR Workbenchis a suite of Web-based tools and REST APIs that allows users to construct metadata templates, to fill in templates to generate high-quality metadata, and to share and manage these resources. The CEDAR Workbench provides a versatile, REST-based environment for authoring metadata that are enriched with terms from ontologies. The metadata are available as JSON, JSON-LD, or RDF for easy integration in scientific applications and reusability on the Web. Users can leverage our APIs for validating and submitting metadata to external repositories. The CEDAR Workbench is freely available and open-source.
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs and so forth. Don’t stop reading. Data management isn’t likely to win anyone a Nobel prize. But publications should be supported and accompanied by data, methods, procedures, etc. to assure reproducibility of results. Funding agencies expect data (and increasingly software) management retention and access plans as part of the proposal process for projects to be funded. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems Biology demands the interlinking and exchange of assets and the systematic recording
of metadata for their interpretation.
The FAIR Guiding Principles for scientific data management and stewardship (http://www.nature.com/articles/sdata201618) has been an effective rallying-cry for EU and USA Research Infrastructures. FAIRDOM (Findable, Accessible, Interoperable, Reusable Data, Operations and Models) Initiative has 8 years of experience of asset sharing and data infrastructure ranging across European programmes (SysMO and EraSysAPP ERANets), national initiatives (de.NBI, German Virtual Liver Network, UK SynBio centres) and PI's labs. It aims to support Systems and Synthetic Biology researchers with data and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety.
This talk will use the FAIRDOM Initiative to discuss the FAIR management of data, SOPs, and models for Sys Bio, highlighting the challenges of and approaches to sharing, credit, citation and asset infrastructures in practice. I'll also highlight recent experiments in affecting sharing using behavioural interventions.
http://www.fair-dom.org
http://www.fairdomhub.org
http://www.seek4science.org
Presented at COMBINE 2016, Newcastle, 19 September.
http://co.mbine.org/events/COMBINE_2016
Connecting the dots: drug information and Linked DataTomasz Adamusiak
Presented as part of the AMIA2014 Knowledge Representation + Semantics and
Clinical Information Systems Working Groups Pre-Symposium "Drug
Terminology Standards: Meaningful Use and Better Knowledge"
November 16, 2014
Washington, DC
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
presented at 1st First International Workshop on Reproducible Open Science @ TPDL, 9 Sept 2016, Hannover, Germany
http://repscience2016.research-infrastructures.eu/
Facilitating semantic alignment of EMBL-EBI services using ontologies and semantic web technology. Presentation at the BioHackathon Symposium 2016, Japan.
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
With its focus on investigating the basis for the sustained existence
of living systems, modern biology has always been a fertile, if not
challenging, domain for formal knowledge representation and automated
reasoning. With thousands of databases and hundreds of ontologies now
available, there is a salient opportunity to integrate these for
discovery. In this talk, I will discuss our efforts to build a rich
foundational network of ontology-annotated linked data, develop
methods to intelligently retrieve content of interest, uncover
significant biological associations, and pursue new avenues for drug
discovery. As the portfolio of Semantic Web technologies continue to
mature in terms of functionality, scalability, and an understanding of
how to maximize their value, researchers will be strategically poised
to pursue increasingly sophisticated KR projects aimed at improving
our overall understanding of human health and disease.
bio: Dr. Michel Dumontier is an Associate Professor of Medicine
(Biomedical Informatics) at Stanford University. His research aims to
find new treatments for rare and complex diseases. His research
interest lie in the publication, integration, and discovery of
scientific knowledge. Dr. Dumontier serves as a co-chair for the World
Wide Web Consortium Semantic Web in Health Care and Life Sciences
Interest Group (W3C HCLSIG) and is the Scientific Director for
Bio2RDF, a widely used open-source project to create and provide
linked data for life sciences.
Model organisms such as budding yeast provide a common platform to interrogate and understand cellular and physiological processes. Knowledge about model organisms, whether generated during the course of scientific investigation, or extracted from published articles, are made available by model organism databases (MODs) such as the Saccharomyces Genome Database (SGD) for powerful, data-driven bioinformatic analyses. Integrative platforms such as InterMine offer a standard platform for MOD data exploration and data mining. Yet, today’s bioinformatic analyses also requires access to a significantly broader set of structured biomedical data, such as what can be found in the emerging network of Linked Open Data (LOD). If MOD data could be provisioned as FAIR (Findable, Accessible, Interoperable, and Reusable), then scientists could leverage a greater amount of interoperable data in knowledge discovery.
The goal of this proposal is to increase the utility of MOD data by implementing standards-compliant data access interfaces that interoperate with Linked Data. We will focus our efforts on developing interfaces for data access, data retrieval, and query answering for SGD. Our software will publish InterMine data as LOD that are semantically annotated with ontologies and be retrieved using standardized formats (e.g. JSON-LD, Turtle). We will facilitate the exploration of MOD data for hypothesis testing, by implementing efficient query answering using Linked Data Fragments, and by developing a set of graphical user interfaces to search for data of interest, explore connections, and answer questions that leverage the wider LOD network. Finally, we will develop a locally and cloud-deployable image to enable the rapid deployment of the proposed infrastructure. Our efforts to increase interoperability and ease of deployment for biomedical data repositories will increase research productivity and reduce costs associated with data integration and warehouse maintenance.
The scientific and economic value of research data is enormous. To ensure successful subsequent usage, the scientific community needs efficient access to data, the data has to be reliable and persistent, and the quality of the data has to be proved.
One solution to these preconditions is to apply the techniques of today’s scientific publishing to research data. Besides its publication in a data repository together with some metadata, the data should undergo a transparent public peer-review using a publication platform.
The presentation discusses two approaches. On the one hand, the data can be the basis for a research article and undergoes a review parallel to the review of the manuscript. The data is then a reviewed supplement to a scientific publication. On the other hand, the data itself can be the subject of a publication whose quality is then assured by peers.
The presentation provides practical experience, especially with the latter strategy, realized through an established open access journal.
Information recovery is the recovery of things (objects, Web pages, archives, and so forth) that fulfill explicit conditions set in an ordinary articulation like query. While IR targets fulfilling a bit of client data need generally communicated in common language, information recovery targets figuring out which records contain the specific terms of the user queries.
As BioPharma adapts to incorporate nimble networks of suppliers, collaborators, and regulators the ability to link data is critical for dynamic interoperability. Adoption of linked data paradigm allows BioPharma to focus on core business: delivering valuable therapeutics in a timely manner.
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
The metadata about scientific experiments are crucial for finding, reproducing, and reusing the data that the metadata describe. We present a study of the quality of the metadata stored in BioSample—a repository of metadata about samples used in biomedical experiments managed by the U.S. National Center for Biomedical Technology Information (NCBI). We tested whether 6.6 million BioSample metadata records are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the analyzed metadata. The BioSample metadata field names and their values are not standardized or controlled—15% of the metadata fields use field names not specified in the BioSample data dictionary. Only 9 out of 452 BioSample-specified fields ordinarily require ontology terms as values, and the quality of these controlled fields is better than that of uncontrolled ones, as even simple binary or numeric fields are often populated with inadequate values of different data types (e.g., only 27% of Boolean values are valid). Overall, the metadata in BioSample reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The aberrancies in the metadata are likely to impede search and secondary use of the associated datasets.
The availability of high-quality metadata is key to facilitating discovery in the large variety of scientific datasets that are increasingly becoming publicly available. However, despite the recent focus on metadata, the diversity of metadata representation formats and the poor support for semantic markup typically result in metadata that are of poor quality. There is a pressing need for a metadata representation format that provides strong interoperation capabilities together with robust semantic underpinnings. In this talk, we describe such a format, together with open-source Web-based tools that support the acquisition, search, and management of metadata. We outline an initial evaluation using metadata from a variety of biomedical repositories.
The Center for Expanded Data Annotation and Retrieval (CEDAR) aims to revolutionize the way that metadata describing scientific experiments are authored. The software we have developedthe CEDAR Workbenchis a suite of Web-based tools and REST APIs that allows users to construct metadata templates, to fill in templates to generate high-quality metadata, and to share and manage these resources. The CEDAR Workbench provides a versatile, REST-based environment for authoring metadata that are enriched with terms from ontologies. The metadata are available as JSON, JSON-LD, or RDF for easy integration in scientific applications and reusability on the Web. Users can leverage our APIs for validating and submitting metadata to external repositories. The CEDAR Workbench is freely available and open-source.
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs and so forth. Don’t stop reading. Data management isn’t likely to win anyone a Nobel prize. But publications should be supported and accompanied by data, methods, procedures, etc. to assure reproducibility of results. Funding agencies expect data (and increasingly software) management retention and access plans as part of the proposal process for projects to be funded. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems Biology demands the interlinking and exchange of assets and the systematic recording
of metadata for their interpretation.
The FAIR Guiding Principles for scientific data management and stewardship (http://www.nature.com/articles/sdata201618) has been an effective rallying-cry for EU and USA Research Infrastructures. FAIRDOM (Findable, Accessible, Interoperable, Reusable Data, Operations and Models) Initiative has 8 years of experience of asset sharing and data infrastructure ranging across European programmes (SysMO and EraSysAPP ERANets), national initiatives (de.NBI, German Virtual Liver Network, UK SynBio centres) and PI's labs. It aims to support Systems and Synthetic Biology researchers with data and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety.
This talk will use the FAIRDOM Initiative to discuss the FAIR management of data, SOPs, and models for Sys Bio, highlighting the challenges of and approaches to sharing, credit, citation and asset infrastructures in practice. I'll also highlight recent experiments in affecting sharing using behavioural interventions.
http://www.fair-dom.org
http://www.fairdomhub.org
http://www.seek4science.org
Presented at COMBINE 2016, Newcastle, 19 September.
http://co.mbine.org/events/COMBINE_2016
Connecting the dots: drug information and Linked DataTomasz Adamusiak
Presented as part of the AMIA2014 Knowledge Representation + Semantics and
Clinical Information Systems Working Groups Pre-Symposium "Drug
Terminology Standards: Meaningful Use and Better Knowledge"
November 16, 2014
Washington, DC
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
presented at 1st First International Workshop on Reproducible Open Science @ TPDL, 9 Sept 2016, Hannover, Germany
http://repscience2016.research-infrastructures.eu/
Facilitating semantic alignment of EMBL-EBI services using ontologies and semantic web technology. Presentation at the BioHackathon Symposium 2016, Japan.
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
With its focus on investigating the basis for the sustained existence
of living systems, modern biology has always been a fertile, if not
challenging, domain for formal knowledge representation and automated
reasoning. With thousands of databases and hundreds of ontologies now
available, there is a salient opportunity to integrate these for
discovery. In this talk, I will discuss our efforts to build a rich
foundational network of ontology-annotated linked data, develop
methods to intelligently retrieve content of interest, uncover
significant biological associations, and pursue new avenues for drug
discovery. As the portfolio of Semantic Web technologies continue to
mature in terms of functionality, scalability, and an understanding of
how to maximize their value, researchers will be strategically poised
to pursue increasingly sophisticated KR projects aimed at improving
our overall understanding of human health and disease.
bio: Dr. Michel Dumontier is an Associate Professor of Medicine
(Biomedical Informatics) at Stanford University. His research aims to
find new treatments for rare and complex diseases. His research
interest lie in the publication, integration, and discovery of
scientific knowledge. Dr. Dumontier serves as a co-chair for the World
Wide Web Consortium Semantic Web in Health Care and Life Sciences
Interest Group (W3C HCLSIG) and is the Scientific Director for
Bio2RDF, a widely used open-source project to create and provide
linked data for life sciences.
Model organisms such as budding yeast provide a common platform to interrogate and understand cellular and physiological processes. Knowledge about model organisms, whether generated during the course of scientific investigation, or extracted from published articles, are made available by model organism databases (MODs) such as the Saccharomyces Genome Database (SGD) for powerful, data-driven bioinformatic analyses. Integrative platforms such as InterMine offer a standard platform for MOD data exploration and data mining. Yet, today’s bioinformatic analyses also requires access to a significantly broader set of structured biomedical data, such as what can be found in the emerging network of Linked Open Data (LOD). If MOD data could be provisioned as FAIR (Findable, Accessible, Interoperable, and Reusable), then scientists could leverage a greater amount of interoperable data in knowledge discovery.
The goal of this proposal is to increase the utility of MOD data by implementing standards-compliant data access interfaces that interoperate with Linked Data. We will focus our efforts on developing interfaces for data access, data retrieval, and query answering for SGD. Our software will publish InterMine data as LOD that are semantically annotated with ontologies and be retrieved using standardized formats (e.g. JSON-LD, Turtle). We will facilitate the exploration of MOD data for hypothesis testing, by implementing efficient query answering using Linked Data Fragments, and by developing a set of graphical user interfaces to search for data of interest, explore connections, and answer questions that leverage the wider LOD network. Finally, we will develop a locally and cloud-deployable image to enable the rapid deployment of the proposed infrastructure. Our efforts to increase interoperability and ease of deployment for biomedical data repositories will increase research productivity and reduce costs associated with data integration and warehouse maintenance.
The scientific and economic value of research data is enormous. To ensure successful subsequent usage, the scientific community needs efficient access to data, the data has to be reliable and persistent, and the quality of the data has to be proved.
One solution to these preconditions is to apply the techniques of today’s scientific publishing to research data. Besides its publication in a data repository together with some metadata, the data should undergo a transparent public peer-review using a publication platform.
The presentation discusses two approaches. On the one hand, the data can be the basis for a research article and undergoes a review parallel to the review of the manuscript. The data is then a reviewed supplement to a scientific publication. On the other hand, the data itself can be the subject of a publication whose quality is then assured by peers.
The presentation provides practical experience, especially with the latter strategy, realized through an established open access journal.
Information recovery is the recovery of things (objects, Web pages, archives, and so forth) that fulfill explicit conditions set in an ordinary articulation like query. While IR targets fulfilling a bit of client data need generally communicated in common language, information recovery targets figuring out which records contain the specific terms of the user queries.
As BioPharma adapts to incorporate nimble networks of suppliers, collaborators, and regulators the ability to link data is critical for dynamic interoperability. Adoption of linked data paradigm allows BioPharma to focus on core business: delivering valuable therapeutics in a timely manner.
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
This session covers topics related to data archiving and sharing. This includes data formats, metadata, controlled vocabularies, preservation, archiving and repositories.
Research Data Sharing and Re-Use: Practical Implications for Data Citation Pr...SC CTSI at USC and CHLA
Date: Apr 4, 2018
Speaker: Hyoungjoo Park, PhD candidate, School of Information Studies, University of Wisconsin-Milwaukee, and Dietmar Wolfram, PhD
Overview: It is increasingly common for researchers to make their data freely available. This is often a requirement of funding agencies but also consistent with the principles of open science, according to which all research data should be shared and made available for reuse. Once data is reused, the researchers who have provided access to it should be acknowledged for their contributions, much as authors are recognised for their publications through citation. Hyoungjoo Park and Dietmar Wolfram have studied characteristics of data sharing, reuse, and citation and found that current data citation practices do not yet benefit data sharers, with little or no consistency in their format. More formalised citation practices might encourage more authors to make their data available for reuse.
The challenge of sharing data well, how publishers can helpVarsha Khodiyar
Researchers, academic institutes and funders are increasingly recognizing the importance of data sharing for reproducible science. However, it is not always straightforward and clear to researchers as to how best to share data in a useful way. At Springer Nature we are working on several initiatives to help facilitate the sharing of research data in a reusable way, with our overarching goal being to publish research that is robust and reproducible. I will talk about the effort that goes into our flagship data journal, Scientific Data, to facilitate best practices in publication and sharing of research data, and share some of our experiences publishing Challenge datasets. I will also describe some of the newer Research Data Services that are now available to help all researchers (not only Springer Nature authors) to share their data in a useful way.
A Generic Scientific Data Model and Ontology for Representation of Chemical DataStuart Chalk
The current movement toward openness and sharing of data is likely to have a profound effect on the speed of scientific research and the complexity of questions we can answer. However, a fundamental problem with currently available datasets (and their metadata) is heterogeneity in terms of implementation, organization, and representation.
To address this issue we have developed a generic scientific data model (SDM) to organize and annotate raw and processed data, and the associated metadata. This paper will present the current status of the SDM, implementation of the SDM in JSON-LD, and the associated scientific data model ontology (SDMO). Example usage of the SDM to store data from a variety of sources with be discussed along with future plans for the work.
Case Study Life Sciences Data: Central for Integrative Systems Biology and Bi...sesrdm
Presentation by Dr Sarah Butcher, Imperial College London at Science and Engineering South (SES) Event - Helping Researchers Manage their Data - Friday 9th May 2014 held at Imperial College London
Citing data in research articles: principles, implementation, challenges - an...FAIRDOM
Prepared and presented by Jo McEntyre (EMBL_EBI) as part of the Reproducible and Citable Data and Models Workshop in Warnemünde, Germany. September 14th - 16th 2015.
Similar to How to make your published data findable, accessible, interoperable and reusable (20)
Introduction to an online resource that displays pre-computed phylogenetic trees of gene families alongside experimental gene function data to facilitate inference of unknown gene function in plants. From the same team that brings you TAIR (The Arabidopsis Information Resource)
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
A Strategic Approach: GenAI in EducationPeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
2. Good Data Stewardship
• Publish data with the paper
• Describe data to your fullest ability
• Use the right words to identify Data
• Deposit data in the right Data Repository
• Budget time for Data Management
• Don’t think of it as YOUR data
3. What’s in it for YOU?
We all benefit from data sharing.
More citations of YOUR work, increasing
your visibility in the research community.
Easily comply with journal and
funding requirements
Less time spent fulfilling requests for data.
7. Data re-use leads to new insights
Data Processing
Quality Control
Validation
503 datasets 314 datasets
Statistical Analysis
Additional Experiments
Yu Zhang et al. PNAS doi:10.1073/pnas.1716300115
NOVEL DISCOVERY
MET1 and CMT3 are independently required for the maintenance of
asymmetric CHH methylation at CMT2 target sites
8. Credit: Melissa Haendel
Wilkinson, et al., (2016) The FAIR Guiding Principles for scientific data management and stewardship
10.1038/sdata.2016.18. https://www.nature.com/articles/sdata201618
• Findable means data is human and machine readable
and attached to persistent identifiers
• Accessible means data can be found and retrieved by
humans and machines using standard formats
• Interoperable means data can be exchanged and used
between systems.
• Reusable means data can be used by others
9. How to Make Your Published Data FAIR
• Use standard formats
• Supply complete metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
10. CHROM POS REF ALT Line
1
Line
2
1 12345 A C A A
3 67891 C T H C
10 23456 G T T U
CHROM POS REF ALT Line
1
Line
2
Gm01 12345 A C 0/0 0/0
Gm03 67891 C T 0/1 0/0
Gm10 23456 G T 1/1 ./.
CHROM POS REF ALT Line
1
Line
2
Chr01 12345 A C AA AA
Chr03 67891 C T C/T CC
Chr10 23456 G T TT NN
ALL MEAN THE SAME!
BUT ARE NOT THE SAME
Use Standard formats: SNP example
SNP (Single Nucleotide Polymorphism): A base, a chromosome
number and genome position, and a reference to the genome
assembly used, and the genotypes of lines tested.
VCF: Variant Call Format
Is the STANDARD
Use the File format
STANDARD
for your data type
12. If you use EXCEL, look out for data corruption and hidden
Microsoft characters that impede parsing
Zeimann, 2016
10.1186/s13059-016-1044-7
Use Standard formats: Beware of Excel
Fig. 1: Prevalence of gene name errors in Supplementary Excel files
Percentage of papers with gene lists effected Increase in supplementary files with gene
name errors per year
13. How to Make Your Published Data FAIR
• Use standard formats
• Supply complete metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
14. Metadata: Species = xxx
Germplasm = xxx
Field location = xxx
Environment = xxx
Measurement = xxx
method
Phenotype (Data): Plant is 170cm tall
Metadata is data about the data,
and allows understanding of the data
Supply Complete Metadata
15. • Write your Materials and Methods as if you wanted
someone else to be able to reproduce your work.
• Be accurate and complete about your bench and field
work; include samples/stocks/lines used, accession
numbers, sources of materials, exact measuring
techniques etc.
• Be AS accurate and complete about your computational
pipelines. Include your created raw data files and
versions. If you use reference data (eg; sequence
assembly), include the version number, download dates,
and download source.
• Include names of software applications, versions,
platforms and source. If you use a CyVerse, use their
metadata reporting tools.
Supply Complete Metadata
19. Supply Complete Metadata
597 Possible Attributes
At least 50 Attributes
Genome Sequence Assembly At least 100 Attributes
20. Budget TIME
to provide Metadata
The metadata in public databases is often confusing; a test case
with Zea mays mRNA seq data reveals a high proportion of
missing, misleading or incomplete metadata. 2018.
https://doi.org/10.1016/j.plantsci.2017.10.014
22. • Established: Genomic Standards Consortium
(http://gensc.org)
• Minimal Information about Any Sequence
• Emerging
• Minimal Information about a Plant Phenotyping Experiment
(MIAPPE)
Metadata Standards for Various Data Types
Supply Complete Metadata
Ask For Help from Database People
23. How to Make Your Published Data FAIR
• Use standard formats
• Supply complete and deep metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
25. Embrace Ontologies
An Ontology is:
A set of precisely defined terms
In a logical hierarchy, and the
Relationship between can be
understood by computers
26. PO:0020105
ligule
Ontologies: Hierarchy of terms and
explicit relationship among terms
Plant
Ontology
(PO)
Ligule
PO:0020105
Vascular leaf
PO:0009025
Leaf sheath
PO:0020104
Flag leaf
PO:0020103
Adult vascular leaf
PO:0020103
Leaf
PO:0025034
27. Data from diverse types of experiments and organisms
can be compared
Henk J. Franssen, et al (2015)
doi: 10.1242/dev.120774
(Medicago)
Li,S. et al., (2016)
10.1016/j.devcel.2016.10.012
Arabidopsis
Zhou, X-F, et a.l. (2014) 10.1104/pp.114.243808
28. Embracing ontologies
• Ontologies provide a POWERFUL, MACHINE READABLE utility
for data
• Find and use existing ontologies
(http://www.obofoundry.org/, Planteome)
• Gene Function = Gene Ontology (GO)
• Sequences = Sequence Ontology (SO)
• Plant Anatomy and Development = Plant Ontology (PO)
• Phenotypes = Phenotype and Trait Ontology (PATO)
• …..many many others
• Apply them consistently
• To datasets (e.g. in metadata)
• In publications (e.g. TAIR GO/PO submission)
• Ask Questions!
29. How to Make Your Published Data FAIR
• Use standard formats
• Supply complete and deep metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
35. How to Make Your Published Data FAIR
• Use standard formats
• Supply complete and deep metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
36. Problem: Data is not findable because
it is not available
Piwowar HA, Vision TJ.(2013)Data reuse and the open data
citation
advantage.PeerJ1:e175https://doi.org/10.7717/peerj.175
Gibney and VanNorden
doi:10.1038/nature.2013.14416
37. Put your data in a stable public
repository
Large International Repositories for many data
types for all species. ALL sequence data goes here
Large but specialized databases serving many species
Soybase
Specialized databases serving specific communities
38. Submitting to a repository: SNP example
As of 9/2017, All NON- human SNPs are
processed through EMBL in the European
Variation Archive (EVA,
https://www.ebi.ac.uk/eva/).
NCBI’s dbSNP will only process Human SNPs
EVA will require:
• Data in (standard) Variant Calling Format
(VCF) including allele frequencies
• SUBMITTED Genome or Transcriptome
assembly
39. What if there is no specialized database?
Or no recommendations from journals ?
You should get a Digital Object Identifier (DOI)
http://datadryad.org
** Curated, metadata
https://zenodo.org/
https://figshare.com/
https://datashare.ucsf.edu/stash
And just for you folks at UC……
40. But.. please, don’t forget to actually complete
your submission*...
*And you never have to spend time fielding requests
or transferring huge data files again
42. How to Make Your Published Data FAIR
• Use standard formats
• Supply complete and deep metadata
• Embrace Ontologies
• Use persistent and unambiguous identifiers
• Put your data in a long term stable repository
• Cite, share freely and encourage others
43. Cite, share freely and encourage others to be FAIR
Include searchable and citable identifiers for your data in
your papers
Release your data with clearly defined terms of use
e.g. Creative Commons (CC) CC-0, CC-BY
If you do not specify restrictions are implied limiting reuse
Cite all of your data sources
Enhances reproducibility….. and also shows value to funders!
When reviewing papers check them for FAIRness
45. A few simple things to remember when
preparing your paper
• Include unambiguous identifiers
• Format data according to defined standards
• Keep data in (parseable) tables or text
• Include meaningful metadata
• Deposit data in a long term stable public repository and get a
DOI
• It is never to early to think about (meta) data, the best time to
start is BEFORE you are writing
46. You can get help structuring,
organizing and managing your data
● Contact your Community Database
● Don’t have one? Contact a curator
(Leonore, Lisa… we live amongst you)
● UCB Research Data Management Librarians
(http://researchdata.berkeley.edu/)
48. What YOU can do right now to
support FAIR data
Ask your funders for increased access to FAIR data
When you review papers- looks at the data, and be
sure it is well described (Metadata is great)
Change your attitude a little: You data will be more
cited, more important if you make it FAIR
Deposit your Data and get a DOI
Ask your institution to value good data submission,
and good data recycling