This webinar discusses workflow tools to support life science research. It includes presentations on the Common Workflow Language (CWL) by Michael Crusoe and uses of Knime and Pipeline Pilot workflows with Open PHACTS examples. There will also be a panel discussion on the future of workflows for life science research with speakers from Eli Lilly, Janssen, and others. Example CWL workflows are shown to demonstrate portable life science workflows.
Scott Edmunds slides for class 8 from the HKU Data Curation (module MLIM7350 from the Faculty of Education) course covering science data, medical data and ethics, and the FAIR data principles.
Embedded with the Scientists: The UCLA Experiencelmfederer
My slides for participation in the Fall 2012 Professional Development Day for New England Librarians ,November 7, 2012 (for more information, see http://libraryguides.umassmed.edu/Informationists)
This document provides an overview of TOPSAN (The Open Protein Structure Annotation Network). TOPSAN is a database that provides extensive annotations for nearly 10,000 protein structures solved by structural genomics centers. It combines automated and human-edited annotations and characterizes single proteins, protein families, and entire genomes. The document explains that TOPSAN uses semantic web principles to connect curated and automatic annotations to other analysis tools and databases, allowing the TOPSAN content to be searched, exported, and analyzed as part of the larger web of biological data.
The Seven Deadly Sins of BioinformaticsDuncan Hull
Keynote talk at Bioinformatics Open Source Conference (BOSC) Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007 by Carole Goble, University of Manchester.
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
For a Bioinformatics Discussion for Students and Post-Docs (BioDSP) meeting: Expands on Sandve's "Ten Simple Rules for Reproducible Computational Research"
Scott Edmunds slides for class 8 from the HKU Data Curation (module MLIM7350 from the Faculty of Education) course covering science data, medical data and ethics, and the FAIR data principles.
Embedded with the Scientists: The UCLA Experiencelmfederer
My slides for participation in the Fall 2012 Professional Development Day for New England Librarians ,November 7, 2012 (for more information, see http://libraryguides.umassmed.edu/Informationists)
This document provides an overview of TOPSAN (The Open Protein Structure Annotation Network). TOPSAN is a database that provides extensive annotations for nearly 10,000 protein structures solved by structural genomics centers. It combines automated and human-edited annotations and characterizes single proteins, protein families, and entire genomes. The document explains that TOPSAN uses semantic web principles to connect curated and automatic annotations to other analysis tools and databases, allowing the TOPSAN content to be searched, exported, and analyzed as part of the larger web of biological data.
The Seven Deadly Sins of BioinformaticsDuncan Hull
Keynote talk at Bioinformatics Open Source Conference (BOSC) Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007 by Carole Goble, University of Manchester.
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
For a Bioinformatics Discussion for Students and Post-Docs (BioDSP) meeting: Expands on Sandve's "Ten Simple Rules for Reproducible Computational Research"
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
Scott Edmunds talk at the HUPO congress in Geneva, September 6th 2011 on GigaScience - a journal or a database? Lessons learned from the Genomics Tsunami.
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
Typically in predictive data analysis challenges, participants are provided a dataset and asked to make predictions. Participants include with their prediction the scripts/code used to produce it. Challenge administrators validate the winning model by reconstructing and running the source code.
Often data cannot be provided to participants directly, e.g. due to data sensitivity (data may be from living human subjects) or data size (tens of terabytes). Further, predictions must be reproducible from the code provided by particpants. Containerization is an excellent solution to these problems: Rather than providing the data to the participants, we ask the participants to provided a Dockerized "trainable" model. We run the both the training and validation phases of machine learning and guarantee reproducibility 'for free'.
We use the Docker tool suite to spin up and run servers in the cloud to process the queue of submitted containers, each essentially a batch job. This fleet can be scaled to match the level of activity in the challenge. We have used Docker successfully in our 2015 ALS Stratification Challenge and our 2015 Somatic Mutation Calling Tumour Heterogeneity (SMC-HET) Challenge, and are starting an implementation for our 2016 Digitial Mammography Challenge.
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
This workshop aims at gathering together practioners of all levels and from a variety of research areas (agronomy, plant biology, food, life sciences etc) to compare best practices, points of views and projects about producing and consuming data in the agrifood field.
As it happens in general for digital data, the current trends in this arena include integration of "traditional" semantic-based approaches (eg, ontoloies, RDF-based linked data) with lightweight schemas (eg, Bioschemas/schema.org), use of JSON-based APIs, development of data lakes and knowledge graphs based on NoSQL technologies, graph databases based on property graphs (eg, Neo4j, TinkerPop/Gremlin).
Workshop participants will get an opportunity to discuss how those approaches and technologies are being used in the agrifood field, for the purpose or realising the FAIR data principles and make data sharing a powerful tool for research, industry or socio-economic investigation. In particular, we will propose an interactive session to outline the way participant-proposed datasets can be encoded through bioschemas or similar approaches.
This presentation was provided by Violeta Ilik of Northwestern University during the NISO Virtual Conference held on Feb 15, 2017, entitled Institutional Repositories: Ensuring Yours is Populated, Useful and Thriving. The DOI for this presentation is http://dx.doi.org/10.18131/G3VP6R
The document discusses how universities can maximize research output through open access repositories and metrics. It argues that by mandating that researchers deposit their work in institutional repositories, universities can provide open access to 100% of research articles. This maximizes the visibility, usage, and impact of the research and provides competitive advantages for universities that adopt open access mandates early on. Open access is achieved through "green open access self-archiving," where authors deposit their final, peer-reviewed manuscripts in institutional repositories.
Results may vary: Collaborations Workshop, Oxford 2014Carole Goble
Thoughts on computational science reproducibility with a focus on software. Given at the Software Sustainability Institute's 2014 Collaborations Workshop
12.10.14 Slides, “Roadmap to the Future of SHARE”DuraSpace
Hot Topics: The DuraSpace Community Webinar Series
Series 10: All About the SHared Access Research Ecosystem (SHARE)
Webinar 3: Roadmap to the Future of SHARE
Wednesday, January 14, 2015
Presented by Judy Ruttenberg, Program Director, Association of Research Libraries
This document introduces FAIRDOM, a consortium that provides a platform and services to help researchers organize, manage, share, and preserve research outputs according to FAIR principles. FAIRDOM has been in operation for 10 years and has over 50 installations supporting over 118 projects. It provides tools and services to help researchers collaborate better and integrate their data, models, publications and other research objects. FAIRDOM also works with other organizations and infrastructure providers to support broader research initiatives.
The document describes the first phase of developing the OnScience portal, which involved designing the architecture and schematics. Key points:
- The team split into groups based on skills to work on different phases. Phase 1 focused on architecture.
- Modules like a researcher rating system were planned to make the portal more useful than existing sites. The rating system considered factors like publications.
- Developing a robust e-commerce platform was a challenge to balance user and business interests.
- A dummy platform tested the rating system algorithm by having users create profiles before the public launch.
- The main page layout was designed using interface tools to optimize the user experience. PHP and JavaScript were selected for the technical
This presentation was provided by Kristi Holmes of Northwestern University during the NISO hot topic virtual conference "Effective Data Management," which was held on September 29, 2021.
The document summarizes the collaboration between research libraries and computational research. It discusses how libraries traditionally provided curation, preservation, and sharing functions but now face challenges in continuing these roles with large computational analyses. The libraries must collaborate with research computing to address issues like data preservation requirements conflicting with computational resource needs. Recent projects between Hesburgh Libraries and research computing are highlighted as successful examples of such collaboration, including initiatives to develop tools for reproducible computational research and preservation of executable software and datasets.
Jean-Claude Bradley presents on "Peer Review and Science2.0: blogs, wikis and social networking sites" as a guest lecturer for the “Peer Review Culture in Scholarly Publication and Grantmaking” course at Drexel University. The main thrust of the presentation is that peer review alone is not capable of coping with the increasing flood of scientific information being generated and shared. Arguments are made to show that providing sufficient proof for scientific findings does scale and weakens the tragedy of the trusted source cascade.
Open Knowledge and University of Cambridge European Bioinformatics InstituteTheContentMine
This document discusses open data and open science. It highlights Jean-Claude Bradley as a pioneer of open notebook science and open data who believed closed data means people die. It describes tools like ContentMine that can automatically extract data like chemical reactions, phylogenetic trees and clinical trial results from papers. Visitors can extract specific types of data while repositories can solve problems communally with continuous publication and validation.
This document discusses open data and open science. It highlights Jean-Claude Bradley as a pioneer of open notebook science and open data who believed closed data means people die. It describes tools like ContentMine that can automatically extract data like chemical reactions, phylogenetic trees and clinical trial results from papers. Visitors can extract specific types of data while repositories can solve problems communally with continuous publication and validation.
Metadata and Semantics Research Conference, Manchester, UK 2015
Research Objects: why, what and how,
In practice the exchange, reuse and reproduction of scientific experiments is hard, dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: codes fork, data is updated, algorithms are revised, workflows break, service updates are released. Neither should they be viewed just as second-class artifacts tethered to publications, but the focus of research outcomes in their own right: articles clustered around datasets, methods with citation profiles. Many funders and publishers have come to acknowledge this, moving to data sharing policies and provisioning e-infrastructure platforms. Many researchers recognise the importance of working with Research Objects. The term has become widespread. However. What is a Research Object? How do you mint one, exchange one, build a platform to support one, curate one? How do we introduce them in a lightweight way that platform developers can migrate to? What is the practical impact of a Research Object Commons on training, stewardship, scholarship, sharing? How do we address the scholarly and technological debt of making and maintaining Research Objects? Are there any examples
I’ll present our practical experiences of the why, what and how of Research Objects.
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne UlitmatumAnita de Waard
Elsevier's RDM Program: Ten Habits of Highly Effective Data
The document outlines Elsevier's research data management (RDM) program and efforts to support the effective management of research data. It discusses a "Maslow hierarchy" with 10 aspects of highly effective research data from stored to integrated. It provides examples of Elsevier's RDM tools and services like Hivebench, Mendeley Data, and DataSearch that help support storing, sharing, citing, and discovering research data. It also discusses collaborative RDM efforts like Force11, Research Data Alliance, and Crossref as well as journal initiatives to improve reproducibility. The document concludes with a proposed partnership where an institution could pilot and provide feedback on Elsevier's
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong
Scott Edmunds talk at the HUPO congress in Geneva, September 6th 2011 on GigaScience - a journal or a database? Lessons learned from the Genomics Tsunami.
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
Typically in predictive data analysis challenges, participants are provided a dataset and asked to make predictions. Participants include with their prediction the scripts/code used to produce it. Challenge administrators validate the winning model by reconstructing and running the source code.
Often data cannot be provided to participants directly, e.g. due to data sensitivity (data may be from living human subjects) or data size (tens of terabytes). Further, predictions must be reproducible from the code provided by particpants. Containerization is an excellent solution to these problems: Rather than providing the data to the participants, we ask the participants to provided a Dockerized "trainable" model. We run the both the training and validation phases of machine learning and guarantee reproducibility 'for free'.
We use the Docker tool suite to spin up and run servers in the cloud to process the queue of submitted containers, each essentially a batch job. This fleet can be scaled to match the level of activity in the challenge. We have used Docker successfully in our 2015 ALS Stratification Challenge and our 2015 Somatic Mutation Calling Tumour Heterogeneity (SMC-HET) Challenge, and are starting an implementation for our 2016 Digitial Mammography Challenge.
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
This workshop aims at gathering together practioners of all levels and from a variety of research areas (agronomy, plant biology, food, life sciences etc) to compare best practices, points of views and projects about producing and consuming data in the agrifood field.
As it happens in general for digital data, the current trends in this arena include integration of "traditional" semantic-based approaches (eg, ontoloies, RDF-based linked data) with lightweight schemas (eg, Bioschemas/schema.org), use of JSON-based APIs, development of data lakes and knowledge graphs based on NoSQL technologies, graph databases based on property graphs (eg, Neo4j, TinkerPop/Gremlin).
Workshop participants will get an opportunity to discuss how those approaches and technologies are being used in the agrifood field, for the purpose or realising the FAIR data principles and make data sharing a powerful tool for research, industry or socio-economic investigation. In particular, we will propose an interactive session to outline the way participant-proposed datasets can be encoded through bioschemas or similar approaches.
This presentation was provided by Violeta Ilik of Northwestern University during the NISO Virtual Conference held on Feb 15, 2017, entitled Institutional Repositories: Ensuring Yours is Populated, Useful and Thriving. The DOI for this presentation is http://dx.doi.org/10.18131/G3VP6R
The document discusses how universities can maximize research output through open access repositories and metrics. It argues that by mandating that researchers deposit their work in institutional repositories, universities can provide open access to 100% of research articles. This maximizes the visibility, usage, and impact of the research and provides competitive advantages for universities that adopt open access mandates early on. Open access is achieved through "green open access self-archiving," where authors deposit their final, peer-reviewed manuscripts in institutional repositories.
Results may vary: Collaborations Workshop, Oxford 2014Carole Goble
Thoughts on computational science reproducibility with a focus on software. Given at the Software Sustainability Institute's 2014 Collaborations Workshop
12.10.14 Slides, “Roadmap to the Future of SHARE”DuraSpace
Hot Topics: The DuraSpace Community Webinar Series
Series 10: All About the SHared Access Research Ecosystem (SHARE)
Webinar 3: Roadmap to the Future of SHARE
Wednesday, January 14, 2015
Presented by Judy Ruttenberg, Program Director, Association of Research Libraries
This document introduces FAIRDOM, a consortium that provides a platform and services to help researchers organize, manage, share, and preserve research outputs according to FAIR principles. FAIRDOM has been in operation for 10 years and has over 50 installations supporting over 118 projects. It provides tools and services to help researchers collaborate better and integrate their data, models, publications and other research objects. FAIRDOM also works with other organizations and infrastructure providers to support broader research initiatives.
The document describes the first phase of developing the OnScience portal, which involved designing the architecture and schematics. Key points:
- The team split into groups based on skills to work on different phases. Phase 1 focused on architecture.
- Modules like a researcher rating system were planned to make the portal more useful than existing sites. The rating system considered factors like publications.
- Developing a robust e-commerce platform was a challenge to balance user and business interests.
- A dummy platform tested the rating system algorithm by having users create profiles before the public launch.
- The main page layout was designed using interface tools to optimize the user experience. PHP and JavaScript were selected for the technical
This presentation was provided by Kristi Holmes of Northwestern University during the NISO hot topic virtual conference "Effective Data Management," which was held on September 29, 2021.
The document summarizes the collaboration between research libraries and computational research. It discusses how libraries traditionally provided curation, preservation, and sharing functions but now face challenges in continuing these roles with large computational analyses. The libraries must collaborate with research computing to address issues like data preservation requirements conflicting with computational resource needs. Recent projects between Hesburgh Libraries and research computing are highlighted as successful examples of such collaboration, including initiatives to develop tools for reproducible computational research and preservation of executable software and datasets.
Jean-Claude Bradley presents on "Peer Review and Science2.0: blogs, wikis and social networking sites" as a guest lecturer for the “Peer Review Culture in Scholarly Publication and Grantmaking” course at Drexel University. The main thrust of the presentation is that peer review alone is not capable of coping with the increasing flood of scientific information being generated and shared. Arguments are made to show that providing sufficient proof for scientific findings does scale and weakens the tragedy of the trusted source cascade.
Open Knowledge and University of Cambridge European Bioinformatics InstituteTheContentMine
This document discusses open data and open science. It highlights Jean-Claude Bradley as a pioneer of open notebook science and open data who believed closed data means people die. It describes tools like ContentMine that can automatically extract data like chemical reactions, phylogenetic trees and clinical trial results from papers. Visitors can extract specific types of data while repositories can solve problems communally with continuous publication and validation.
This document discusses open data and open science. It highlights Jean-Claude Bradley as a pioneer of open notebook science and open data who believed closed data means people die. It describes tools like ContentMine that can automatically extract data like chemical reactions, phylogenetic trees and clinical trial results from papers. Visitors can extract specific types of data while repositories can solve problems communally with continuous publication and validation.
Metadata and Semantics Research Conference, Manchester, UK 2015
Research Objects: why, what and how,
In practice the exchange, reuse and reproduction of scientific experiments is hard, dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: codes fork, data is updated, algorithms are revised, workflows break, service updates are released. Neither should they be viewed just as second-class artifacts tethered to publications, but the focus of research outcomes in their own right: articles clustered around datasets, methods with citation profiles. Many funders and publishers have come to acknowledge this, moving to data sharing policies and provisioning e-infrastructure platforms. Many researchers recognise the importance of working with Research Objects. The term has become widespread. However. What is a Research Object? How do you mint one, exchange one, build a platform to support one, curate one? How do we introduce them in a lightweight way that platform developers can migrate to? What is the practical impact of a Research Object Commons on training, stewardship, scholarship, sharing? How do we address the scholarly and technological debt of making and maintaining Research Objects? Are there any examples
I’ll present our practical experiences of the why, what and how of Research Objects.
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne UlitmatumAnita de Waard
Elsevier's RDM Program: Ten Habits of Highly Effective Data
The document outlines Elsevier's research data management (RDM) program and efforts to support the effective management of research data. It discusses a "Maslow hierarchy" with 10 aspects of highly effective research data from stored to integrated. It provides examples of Elsevier's RDM tools and services like Hivebench, Mendeley Data, and DataSearch that help support storing, sharing, citing, and discovering research data. It also discusses collaborative RDM efforts like Force11, Research Data Alliance, and Crossref as well as journal initiatives to improve reproducibility. The document concludes with a proposed partnership where an institution could pilot and provide feedback on Elsevier's
Open PHACTS provides a single access point for integrating multiple biomedical data resources. It has transitioned from an EU project to the Open PHACTS Foundation to sustain the platform long-term. Challenges included addressing licensing issues across different data sources and enabling maximum dissemination. Usage has grown to over 500 million queries. The Foundation is pursuing collaboration, grants, and industry partnerships to support ongoing development and new projects. It welcomes contributions to improve services and develop new data and workflows.
Open PHACTS Webinar: Computational Protocols for In Silico Target Validationopen_phacts
Watch the full webinar on YouTube at https://youtu.be/Wc7ynRyojM4
The second in our monthly webinar series, covering the latest updates to the Open PHACTS Discovery Platform, and how they can benefit you and your research.
This month Edgar Jacoby (Janssen) discusses computational protocols for in silico target validation, and "knowing the knowns" in phenotypic screening.
2015-05-19 Open PHACTS Drug Discovery Workflow Workshop - KNIMEopen_phacts
An explanation of the Open PHACTS API, and how you can use it to help with your drug discovery workflows. Presented by Daniela Digles at the Open PHACTS Drug Discovery Workflow Workshop: http://www.openphactsfoundation.org/open-phacts-pipeline-pilot-knime-workshop/
2015-05-19 Open PHACTS Drug Discovery Workflow Workshop - The APIopen_phacts
1. The document describes the Open PHACTS API workflow for querying biological and chemical data.
2. It provides an overview of the API including documentation, entry points, response templates, and concept types that can be queried.
3. Examples are given of API calls to retrieve information on compounds, targets, tissues, diseases, and pathways from various data sources.
2015-02-10 The Open PHACTS Discovery Platform: Semantic Data Integration for ...open_phacts
The Open PHACTS Discovery Platform integrates multiple biomedical data resources into a single open access point using semantic web technology. It is guided by business questions from pharmaceutical companies to integrate data from sources like ChEMBL, DrugBank, UniProt, and more. The platform is run as a public-private partnership through 2021 to support drug discovery.
2014-03-20 Open PHACTS - A Data Platform for Drug Discoveryopen_phacts
A data platform is proposed for drug discovery that would lower industry firewalls and enable pre-competitive data integration, analysis, and reuse across pharmaceutical companies. The platform would integrate external research data from literature, databases, and other sources on compounds, targets, pathways, and diseases. It would provide data integration and analysis tools through a firewalled database system and applications. The goal is to advance drug discovery by allowing multiple companies to access and build upon the same large foundation of pre-competitive research data.
The document discusses the Open PHACTS platform, which aims to reduce barriers to drug discovery by integrating pharmacological data from multiple sources into a single API. The platform uses semantic technologies to flexibly integrate datasets and allow adaptive querying. It provides tools and services to support pharmacological research for industry, academia, and small businesses.
2013 Open PHACTS Scientific Questions Posteropen_phacts
This document discusses scientific competency questions that were collected by the Open PHACTS consortium to guide the development of the Open PHACTS integrated pharmacological data platform. 83 questions were provided by consortium members and prioritized, with the top 20 questions clustered into two groups related to compound-target and compound-target-disease/pathway interactions. Analyzing the questions revealed that compound, target, pathway and disease data needs to be associated to answer them. This informed the selection of public databases and drove the requirements for linking data sources in Open PHACTS.
This document introduces several exemplar applications that were developed to showcase the capabilities of the Open PHACTS platform API. The exemplars include ChemBio Navigator, which allows browsing chemical and biological data for drug discovery applications; Polypharmacology Browser tools like GARField and PharmaTrek that enable exploration of compound-target interactions; and the Target Dossier, which compiles target-related information for decision support in target selection and validation. These exemplars demonstrate how diverse data integrated through the Open PHACTS platform can address relevant problems in drug development and biomedical research.
Presented by Richard Kidd at "The Future Information Needs of Pharmaceutical & Medicinal Chemistry", Monday 28 November 2011 at The Linnean Society, Burlington Square, London run by the RSC CICAG group.
2011-10-11 Open PHACTS at BioIT World Europeopen_phacts
The document discusses the Innovative Medicines Initiative's Open PHACTS project, which aims to develop robust standards and apply them in a semantic integration platform ("Open Pharmacological Space") to integrate drug discovery data from various public and private sources. The project brings together partners from industry, academia, and non-profits to build an open infrastructure for linking drug discovery knowledge and supporting ongoing research. It outlines the technical approach, priorities, and initial progress on developing exemplar applications and a prototype "lash up" system.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Enhanced data collection methods can help uncover the true extent of child abuse and neglect. This includes Integrated Data Systems from various sources (e.g., schools, healthcare providers, social services) to identify patterns and potential cases of abuse and neglect.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Generative Classifiers: Classifying with Bayesian decision theory, Bayes’ rule, Naïve Bayes classifier.
Discriminative Classifiers: Logistic Regression, Decision Trees: Training and Visualizing a Decision Tree, Making Predictions, Estimating Class Probabilities, The CART Training Algorithm, Attribute selection measures- Gini impurity; Entropy, Regularization Hyperparameters, Regression Trees, Linear Support vector machines.
Open PHACTS April 2017 Science webinar Workflow tools
1. Workflow tools for Life Science
Research
Apr 2017
nick@openphactsfoundation.org
2. This webinar is being
recorded and will be uploaded
to Slideshare etc afterwards
@Open_PHACTS
LinkedIn Group
RSS & Newsletter
3. Agenda
Introduction to common workflow language (CWL) -
Michael Crusoe
Accessing Open PHACTS with Knime nodes to support
Life Science Business questions - James Lumley, Eli
Lilly & Company
Pipeline Pilot workflows with Open PHACTS Examples
Jean-Marc Neefs, Janssen
Panel discussion on where next with Workflow and
supporting Life Science research
4. Our speakers & panel
Michael Crusoe, Common Workflow Language co-founder
James Lumley, Informatics, Eli Lilly & Company
Jean-Marc Neefs, Janssen
Panel:
– Michael Crusoe, James Lumley, Jean-Marc Neefs
– Derek Marren, Eli Lilly
– Daniela Digles, University of Vienna
– Andrei Caracoti, Biovia
5. Workflow Examples
The Application of the Open Pharmacological Concepts Triple Store (Open
PHACTS) to Support Drug Discovery Research
PLoS ONE 2014 DOI: 10.1371/journal.pone.0115460
Drug discovery FAQs: workflows for answering multidomain drug discovery
questions
Drug Discovery Today 2015 DOI: 10.1016/j.drudis.2014.11.006
Open PHACTS computational protocols for in silico target validation of
cellular phenotypic screens: knowing the knowns
Med. Chem. Commun. 2016 DOI: 10.1039/c6md00065g
Selectivity profiling of BCRP versus P-gp inhibition: from automated
collection of polypharmacology data to multi-label learning
J Cheminform 2016 DOI: 10.1186/s13321-016-0121-y
7. https://goo.gl/Aujxza
Why use a workflow management system?
Features can include:
● separation of concerns: focus on the science being
done first; then optimize execution later
● automatic job execution: start a complicated
analysis involving many pieces with a single command
● scaling (across nodes, clusters, and possibly
continents)
● automatically generated graphical user interfaces
(example: Galaxy)
● How was this file made? (automatic provenance
tracking)
12. https://goo.gl/Aujxza
Why have a standard?
● Standards create a surface for collaboration that
promote innovation
● Research frequently dip in and out of different
systems but interoperability is not a basic
feature.
● Funders, journals, and other sources of
incentives prefer standards over proprietary or
single-source approaches
13. https://goo.gl/Aujxza
Common Workflow Language v1.0
● Common format for bioinformatics (and more!) tool
& workflow execution
● Community based standards effort, not a specific
software package; Very extensible
● Defined with a schema, specification, & test
suite
● Designed for shared-nothing clusters, academic
clusters, cloud environments, and local execution
● Supports the use of containers (e.g. Docker) and
shared research computing clusters with locally
installed software
15. https://goo.gl/Aujxza
Why use the Common Workflow Language?
Develop your pipeline on your local computer
(optionally with containers)
Execute on your research cluster or in the cloud
Deliver to users via workbenches like Arvados, Rabix,
Toil. Galaxy, Apache Taverna, AWE, Funnel (GCP)
support is in alpha stage.
16. https://goo.gl/Aujxza
● Low barrier to entry for implementers
● Support tooling such as generators, GUIs, converters
● Allow extensions, but must be well marked
● Be part of linked data ecosystem
● Be pragmatic
CWL Design principles
17. https://goo.gl/Aujxza
Linked Data & CWL
● Hyperlinks are common currency
● Bring your own RDF ontologies for metadata
● Supports SPARQL to query
Example: can use the EDAM ontology (ELIXIR-DK) to
specify file formats and reason about them:
“FASTQ Sanger” encoding is a type of FASTQ file
18. https://goo.gl/Aujxza
Use Cases for the CWL standards
Publication reproducibility, reusability
Workflow creation & improvement across institutions
and continents
Contests & challenges
Analysis on non-public data sets, possibly using GA4GH
job & workflow submission API
19. https://goo.gl/Aujxza
Early Adopters
(US) National Cancer Institute Cloud Pilots (Seven
Bridges Genomics, Institute for Systems Biology)
Cincinnati Children’s Hospital Medical Research Center
(Andrey Kartashov & Artem Barski)
bcbio: Validated, scalable, community developed
variant calling, RNA-seq and small RNA analysis (docs,
BOSC 2016 talk: video, slides) (Brad Chapman et al.)
Duke University, Center for Genomic and Computational
Biology: GENOMICS OF GENE REGULATION project (BOSC
2016 talk: video, slides, poster)(Dan Leehr et al.)
NCI DREAM SMC-RNA Challenge (Kyle Ellrott et al.)
Presentation
20. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Sample Real World CWL Workflow
Courtesy US NIH NCI Genomic Data Commons, visualization from
https://view.commonwl.org/workflows/github.com/NCI-GDC/gdc-dnaseq-cwl/tree/master/workflows/d
naseq/transform.cwl
21. https://goo.gl/Aujxza
Announcing: v1.0!
http://www.commonwl.org/v1.0/
Authors:
Peter Amstutz, Arvados Project, Curoverse
Michael R. Crusoe, Common Workflow Language project
Nebojša Tijanić, Seven Bridges Genomics
Contributors:
Brad Chapman, Harvard Chan School of Public Health
John Chilton, Galaxy Project, Pennsylvania State University
Michael Heuer, UC Berkeley AMPLab
Andrey Kartashov, Cincinnati Children's Hospital
Dan Leehr, Duke University
Hervé Ménager, Institut Pasteur
Maya Nedeljkovich, Seven Bridges Genomics
Matt Scales, Institute of Cancer Research, London
Stian Soiland-Reyes, University of Manchester
Luka Stojanovic, Seven Bridges Genomics
22. https://goo.gl/Aujxza
How did we do it?
Initial group started at BOSC Codefest 2014
Moved to open mailing list and extended onto GitHub &
then Gitter chat
Frequent (twice a month or more) video chats to work
through design issues with summaries emailed
Some participants doing CWL community work during
their day jobs, some on “nights & weekends”.
In October 2015 Seven Bridges sponsored one of the
co-founders (M. Crusoe) to work full time on the
project
23. https://goo.gl/Aujxza
Community Based Standards development
Different model than traditional nation-based or
regulatory approach
We adopted the Open-Stand.org Modern Paradigm for
Standards: Cooperation, Adherence to Principles (Due
process, Broad consensus, Transparency, Balance,
Openness), Collective Empowerment, (Free)
Availability, Voluntary Adoption
24. https://goo.gl/Aujxza
Challenges
Giving a standard to a community that is “free as in
puppies”: How does the community participate? How will
maintenance be funded?
CWL isn’t the only effort that has these needs; can we
join with related efforts?
25. https://goo.gl/Aujxza
A Grand Opportunity
if:
properly funded and embraced by the wider community
then:
the researchobject.org standards + CWL could fulfill
the huge need for an executable and complete
description of how computationaly derived research
results were made
26. https://goo.gl/Aujxza
What’s next for the Common Workflow
Language?
Public charity to own the standard
Tooling improvements
More implementations (Galaxy, Taverna, Kepler, Xenon,
…?)
Integration with researchobject.org standards for
attribution, provenance, and metadata guidance.
28. https://goo.gl/Aujxza
Michael R. Crusoe, who is this guy?
Phoenix, Arizona (Sonoran Desert), USA
Studied at Arizona State University: Computer Science;
time in industry as a developer & system administrator
(Google, others); returned to academia to study
Microbiology.
Introduced to bioinformatics via Anolis (lizard)
genome assembly and analysis (Kenro Kusumi, Arizona
State University)
Returned to software engineering as a Research
Software Engineer for k-h-mer project (C. Titus Brown,
Michigan State University, then U. of California,
Davis)
29. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
File type & metadata
Input parameters
Output parameters
class: CommandLineTool
cwlVersion: v1.0
doc: Sort by chromosomal coordinates
inputs:
aligned_sequences:
type: File
format: edam:format_2572 # BAM binary alignment format
inputBinding:
position: 1
outputs:
sorted_aligned_sequences:
type: stdout
format: edam:format_2572
Executable
baseCommand: [samtools, sort]
hints:
DockerRequirement:
dockerPull: quay.io/cancercollaboratory/dockstore-tool-samtools-sort
Runtime environment
$namespaces: { edam: "http://edamontology.org/" }
$schemas: [ "http://edamontology.org/EDAM_1.15.owl" ]
Linked data support
Example: samtools-sort.cwl
30. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
File type & metadata
class: CommandLineTool
cwlVersion: v1.0
doc: Sort by chromosomal coordinates
● Identify as a CommandLineTool object
● Core spec includes simple comments
● Metadata about tool extensible to arbitrary RDF
vocabularies, e.g.
○ Biotools & EDAM
○ Dublin Core Terms (DCT)
○ Description of a Project (DOAP)
● GA4GH Tool Registry project will develop best
practices for metadata & attribution
31. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
hints:
DockerRequirement:
dockerPull: quay.io/[...]samtools-sort
Runtime Environment
● Define the execution environment of the tool
● “requirements” must be fulfilled or an error
● “hints” are soft requirements (express preference
but not an error if not satisfied)
● Also used to enable optional CWL features
○ Mechanism for defining extensions
32. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Input parameters
● Specify name & type of input parameters
○ Based on the Apache Avro type system
○ null, boolean, int, string, float, array, record
○ File formats can be IANA Media/MIME types, or from domain
specific ontologies, like EDAM for bioinformatics
● “inputBinding”: describes how to turn parameter
value into actual command line argument
inputs:
aligned_sequences:
type: File
format: edam:format_2572 # BAM binary format
inputBinding:
position: 1
33. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
File type & metadata
Input parameters
Output parameters
class: CommandLineTool
cwlVersion: v1.0
doc: Sort by chromosomal coordinates
inputs:
aligned_sequences:
type: File
format: edam:format_2572 # BAM binary alignment format
inputBinding:
position: 1
outputs:
sorted_aligned_sequences:
type: stdout
format: edam:format_2572
Executable
baseCommand: [samtools, sort]
hints:
DockerRequirement:
dockerPull: quay.io/cancercollaboratory/dockstore-tool-samtools-sort
Runtime environment
$namespaces: { edam: "http://edamontology.org/" }
$schemas: [ "http://edamontology.org/EDAM_1.15.owl" ]
Linked data support
Example: samtools-sort.cwl
34. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
inputs:
aligned_sequences:
type: File
format: edam:format_2572
inputBinding:
position: 1
baseCommand: [samtools, sort]
aligned_sequences:
class: File
location: example.bam
format: http://edamontology.org/format_2572
[“samtools”, “sort”, “example.bam”]
Input object
Command Line Building
● Associate input values with parameters
● Apply input bindings to generate strings
● Sort by “position”
● Prefix “base command”
35. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
outputs:
sorted_aligned_sequences:
type: stdout
format: edam:format_2572
Output parameters
● Specify name & type of output parameters
● In this example, capture the STDOUT stream from
“samtools sort” and tag it as being BAM formatted.
36. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Workflows
● Specify data dependencies between steps
● Scatter/gather on steps
● Can nest workflows in steps
● Still working on:
● Conditionals & looping
38. Adapted from Peter Amstutz’s presentation, licensed CC-BY-SA
Example: grep & count
class: Workflow
cwlVersion: v1.0
inputs:
pattern: string
infiles: File[]
outputs:
outfile:
type: File
outputSource: wc/outfile
requirements:
- class: ScatterFeatureRequirement
steps:
grep:
run: grep.cwl
in:
pattern: pattern
infile: infiles
scatter: infile
out: [outfile]
wc:
run: wc.cwl
in:
infiles: grep/outfile
out: [outfile]
Tool to run
Scatter over
input array
Connect output
of “grep” to input
of “wc”
Connect output of “wc”
to workflow output
39. Accessing the
Open PHACTS Linked Data API
with KNIME
James A. Lumley
Research IT, Eli Lilly
April 2017
40. The KNIME Analytics Platform
Open source platform for data analytics. Over 1000 modules (or nodes) to connect to all major data
sources; support for many data types inc. XML/JSON/Images./Docs/Chemical Formats; Math and Stats
functions, Predictive modelling and machine learning; Tool blending for Python/R/Weka/SQL/Java;
Interactive data views and reporting. “a toolbox for any data scientist”.
https://www.knime.org/knime-analytics-platform
41. ♦ 2016 (VU Amsterdam)*
• Original Nodes and workflows by Ronald Siebes, VU Amsterdam
• OPS_Swagger and OPS_JSON nodes used to create and execute the
parameterized API calls, as well as transforming the output to a tabular form
♦ Q2 2017 (Eli Lilly)
• Update of Erl Wood KNIME Nodes will add new OPS node developed internally
at Eli Lilly with input from OPS
– KNIME Node: Luke Bullard
– Team input: James Lumley / Derek Marren (Lilly); Daniella Digles / Nick Lynch (OPS);
Randy Kerber (d2discovery)
– Workflows: James Lumley
• Single Node allows user to select the call of interest and return both JSON and
Tabular results
• Focus of development: Updating to new API, improving usability
• Further iterations possible once feedback received
OPS-KNIME Nodes
* http://www.openphactsfoundation.org/wp/wp-content/uploads/2016/02/2016-02-25_Creating-workflows-for-drug-discovery-with-Open-PHACTS-and-KNIME.pdf
42. OPS & Erl Wood Community Nodes
♦ View based on internal Beta
of Lilly opensource Erl Wood
nodes due for release Q2
2017
♦ Community Erlwood Nodes
Open PHACTS
♦ Open PHACTS sub-folder
contains single OPS Linked
Data API node that will allow a
configured call/return
43. Configuring the OPS Linked Data API node
♦ Preferences panel allows client/workflow
level control of API URL Endpoint and API
Id/Key, avoiding the need to configure
these in the node
44. Using the OPS Linked Data API node
App Id and App Key fields are
automatically populated if they
are set in the preferences
Drop down ‘Select Method Type’
allows selection of API call
45. Using the OPS Linked Data API node
Input port is optional. Toggle
on input field allows user string
input or selection of input table
column
First output port returns
formatted data table
(corresponding to API param
“_format=tsv”)
46. Using the OPS Linked Data API node
Drop down ‘Select Method
Type’ allows selection of API
call
Logically grouped methods
match developer API docs
(swagger) at
https://dev.openphacts.org/d
ocs/2.1
47. Allows formatted results table or full
JSON/XML return for debug/analysis
First output port returns
formatted data table
(corresponding to API
param “_format=tsv”)
Second output port is
optional and if
requested, will return
JSON or XML response
(via second API call
without _format param)
49. User input and example return
Raw Tabular Return:
Pivoted to show Column Names and Values:
50. User input and example return
Optional JSON Output as raw JSON Object
51. User input and example return
Rather than parsing the JSON to
understand the raw output, the node also
has an attached ‘View’ with a hierarchically
formatted tree view of the JSON output:
52. User input and example return
Generic JSON Extraction to
flat table shows additional
data returned from API,
deeper JSON processing
can be done using KNIME
JSON nodes
53. JSON/XML Support in KNIME 3.3
Extensive native support for JSON or XML parsing with KNIME 3.3 allows
complete/custom parsing of the return JSON object for full debugging
54. Chemistry Support on input SMI
Input columns of differing
chemical types are
automatically converted to
SMILES via Marvin if the API
param is SMILES based
55. API Timeouts and URL changes
Advanced developers can
change the API timeout value or
edit the API URL on a single
node using the Web Service
panel
56. 1. A new KNIME 3.3 compatible “OpenPHACTS Linked Data API”
node will be released in Q2 2017
2. Designed for users, it provides easy configuration of API settings
and parameters with easy to user tabular data return (via API
_format parameter)
3. Designed for developers it allows additional full JSON/XML
response that can be viewed/parsed by the expert user to see raw
response
4. Further example workflows will be release once the node is
available
Summary
59. List compounds active on target X
Open PHACTS + Pipeline Pilot Workflow:
1. Search target information
• [OPS API call ‘Free Text to Concept’]
2. Get active compounds on that target
• [OPS API call ‘Target Pharmacology: List’]
63. Find compounds against Alzheimer’s targets
Open PHACTS + Pipeline Pilot Workflow:
1. Search for disease
• [OPS API call ‘Free Text to Concept’]
2. Search target information
• [OPS API call ‘Targets for Disease: List’]
3. Get active compounds on that target
• [OPS API call ‘Target Pharmacology: List’]