This document discusses databases in bioinformatics. It begins by explaining that bioinformatics concerns the creation and maintenance of biological databases to allow researchers to access existing information and submit new entries. The aims of bioinformatics are to organize data, develop analysis tools, and use these tools to analyze data and interpret results in a biologically meaningful way. Several important biological databases are described, including nucleotide sequence databases like NCBI and protein sequence databases. GenBank is also discussed as the annotated collection of all publicly available DNA sequences. Biological databases make large datasets available to researchers and are important for biological research infrastructure.
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFlávio Codeço Coelho
This document discusses the use of cloud computing technologies for genomic big data analysis. It begins by defining big data and describing the exponential growth of genomic data. It then discusses how cloud computing provides flexibility, scalability, and accessibility for genomic data processing through virtualization and large computing clusters. Specific technologies enabled for the cloud that help with genomic analysis are described, such as Hadoop, MapReduce, and genomic analysis tools adapted for these frameworks. The document concludes by discussing challenges remaining around data transfer speeds and the need for cloud application expertise, but also describes how platforms like Galaxy Cloudman and Cloudgene allow genomic analysis in the cloud without programming expertise.
Bioinformatics databases aim to manage the complexity of life by integrating diverse biological data types. Relational databases use standardized identifiers and data formats to store sequence, expression, proteomic, and metabolomic data. Cross-referencing multiple databases through data warehousing and centralized schemas allows for functional querying of biological networks and neighborhoods. Future directions include greater use of machine learning, data mining, and global data standards.
This document discusses challenges to reproducibility in systems biology and potential solutions. It notes a lack of data standards, quality, availability, and transparency make it difficult for researchers to reproduce results. Tools and initiatives discussed that aim to improve reproducibility include the COMBINE archive to bundle necessary files, graph databases to integrate model-related data, and version control systems to track model evolution over time. The overall goal is to better support scientists in sharing reproducible model-based studies.
The document discusses three data retrieval tools - Entrez, DBGET, and SRS - that allow molecular biologists to search and access information across multiple linked databases. Entrez, developed by NCBI, integrates information from databases including GenBank, RefSeq, PDB, and PubMed. SRS, developed by EBI, is an open source software that integrates over 80 molecular biology databases and has a scripting language called Icarus. SRS indexes over 250 databases and has over 35 servers worldwide. It allows searching of sequence, structure, gene-related, and bibliographic databases through a uniform web interface.
This document discusses biological databases. It defines biological databases as structured, searchable collections of biological data that are periodically updated and cross-referenced. It notes that biological databases store data electronically and systematize, make available, and allow analysis of computed biological data. The document then describes some key features of biological databases, including data heterogeneity, high data volumes, uncertainty, data curation, integration, sharing, and dynamic nature. It also provides examples of different types of biological databases classified by data type, maintainer, access, source, design, and organism covered.
Slides from the presentation at IDAMO 2016, Rostock. May 2016.
Most scientific discoveries rely on previous or other findings. A lack of transparency and openness led to what many consider the "reproducibility crisis" in systems biology and systems medicine. The crisis arose from missing standards and inappropriate support of
standards in software tools. As a consequence, numerous results in low-and high-profile publications cannot be reproduced.
In my presentation, I summarise key challenges of reproducibility in systems biology and systems medicine, and I demonstrate available solutions to the related problems.
This document describes several text-based biological databases and how to search them. It discusses Entrez, which searches multiple databases and links related entries. It also describes the Sequence Retrieval System (SRS) which allows searching over 80 biological databases. Additionally, it outlines DBGET/LinkDB, an integrated system that searches about 20 databases and links results to associated information. The document provides an example of using each system to retrieve information on a specific protein entry.
This document discusses databases in bioinformatics. It begins by explaining that bioinformatics concerns the creation and maintenance of biological databases to allow researchers to access existing information and submit new entries. The aims of bioinformatics are to organize data, develop analysis tools, and use these tools to analyze data and interpret results in a biologically meaningful way. Several important biological databases are described, including nucleotide sequence databases like NCBI and protein sequence databases. GenBank is also discussed as the annotated collection of all publicly available DNA sequences. Biological databases make large datasets available to researchers and are important for biological research infrastructure.
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFlávio Codeço Coelho
This document discusses the use of cloud computing technologies for genomic big data analysis. It begins by defining big data and describing the exponential growth of genomic data. It then discusses how cloud computing provides flexibility, scalability, and accessibility for genomic data processing through virtualization and large computing clusters. Specific technologies enabled for the cloud that help with genomic analysis are described, such as Hadoop, MapReduce, and genomic analysis tools adapted for these frameworks. The document concludes by discussing challenges remaining around data transfer speeds and the need for cloud application expertise, but also describes how platforms like Galaxy Cloudman and Cloudgene allow genomic analysis in the cloud without programming expertise.
Bioinformatics databases aim to manage the complexity of life by integrating diverse biological data types. Relational databases use standardized identifiers and data formats to store sequence, expression, proteomic, and metabolomic data. Cross-referencing multiple databases through data warehousing and centralized schemas allows for functional querying of biological networks and neighborhoods. Future directions include greater use of machine learning, data mining, and global data standards.
This document discusses challenges to reproducibility in systems biology and potential solutions. It notes a lack of data standards, quality, availability, and transparency make it difficult for researchers to reproduce results. Tools and initiatives discussed that aim to improve reproducibility include the COMBINE archive to bundle necessary files, graph databases to integrate model-related data, and version control systems to track model evolution over time. The overall goal is to better support scientists in sharing reproducible model-based studies.
The document discusses three data retrieval tools - Entrez, DBGET, and SRS - that allow molecular biologists to search and access information across multiple linked databases. Entrez, developed by NCBI, integrates information from databases including GenBank, RefSeq, PDB, and PubMed. SRS, developed by EBI, is an open source software that integrates over 80 molecular biology databases and has a scripting language called Icarus. SRS indexes over 250 databases and has over 35 servers worldwide. It allows searching of sequence, structure, gene-related, and bibliographic databases through a uniform web interface.
This document discusses biological databases. It defines biological databases as structured, searchable collections of biological data that are periodically updated and cross-referenced. It notes that biological databases store data electronically and systematize, make available, and allow analysis of computed biological data. The document then describes some key features of biological databases, including data heterogeneity, high data volumes, uncertainty, data curation, integration, sharing, and dynamic nature. It also provides examples of different types of biological databases classified by data type, maintainer, access, source, design, and organism covered.
Slides from the presentation at IDAMO 2016, Rostock. May 2016.
Most scientific discoveries rely on previous or other findings. A lack of transparency and openness led to what many consider the "reproducibility crisis" in systems biology and systems medicine. The crisis arose from missing standards and inappropriate support of
standards in software tools. As a consequence, numerous results in low-and high-profile publications cannot be reproduced.
In my presentation, I summarise key challenges of reproducibility in systems biology and systems medicine, and I demonstrate available solutions to the related problems.
This document describes several text-based biological databases and how to search them. It discusses Entrez, which searches multiple databases and links related entries. It also describes the Sequence Retrieval System (SRS) which allows searching over 80 biological databases. Additionally, it outlines DBGET/LinkDB, an integrated system that searches about 20 databases and links results to associated information. The document provides an example of using each system to retrieve information on a specific protein entry.
This document discusses databases in bioinformatics. It begins by noting the rapid increase in biological data from sources like gene sequences, protein sequences, structural data, and gene expression data. It then defines biological databases as structured, searchable collections of data that are periodically updated and cross-referenced. The major purposes of databases are to make biological data available, systematize the data, and allow analysis of computed biological data. The document provides a brief history of biological databases and sequencing efforts. It also classifies biological databases based on data type, maintenance status, data access, data sources, database design, and organism. Specific databases discussed include DDBJ, EMBL, GenBank, Swiss-Prot, and NCB
This document discusses leveraging graph data structures to analyze variant data and related annotations from large genomic datasets. In phase I, simple queries on a graph database had performance speeds better than or equal to a relational database. Complex queries exploring patterns and clusters were also possible. In phase II, spectral clustering of 1000 genomes data identified three main clusters supporting known population genetics patterns, demonstrating the potential of graph databases for mining complex genomic correlations. The results indicate a graph database provides an effective approach for precision cancer research by enabling both known and novel queries on large genomic datasets.
This document discusses leveraging graph data structures to analyze variant data and related annotations from large genomic datasets in a scalable way. An in-memory graph database was used to model variants, annotations, and their relationships. Simple queries on the graph performed as well or better than a relational database. More complex queries and analysis, like spectral clustering of populations, were also possible with the graph model and helped identify patterns not feasible with relational approaches. The results indicate graph databases are a powerful tool for precision medicine research by enabling both known and novel analysis of large genomic datasets.
This document discusses data and model management in systems biology. It covers topics such as data ownership, metadata, ontologies, standards for encoding models and analyses, and tools for working with systems biology models and data. Standards like SBML, SBGN, SED-ML and COMBINE Archive allow for structured representation, visualization, simulation, and sharing of models and data. Resources like SEEK enable curation, simulation and publication of models in a findable, accessible, interoperable and reusable (FAIR) manner.
This document provides information on biological databases, including their history, features, and classifications. It notes that the first protein sequenced was insulin in 1965, and the first genome sequenced was of a virus in 1995. Key features of biological databases discussed include their heterogeneity, high volume of data, uncertainty, data curation, integration, sharing, and dynamic nature as new data is added. Biological databases can be classified by data type, maintainer status, data access, source, design, and organism covered. The purpose of biological databases is to systematically organize and make available vast amounts of complex biological data.
Science is rapidly being brought into the electronic realm and electronic laboratory notebooks (ELN) are a big part of this activity. The representation of the scientific process in the context of an ELN is an important component to making the data recorded in ELNs semantically integrated.
This presentation outlined initial developments of an Electronic Notebook Ontology (ENO) that will help tie together the ExptML ontology, HCLS Community Profile data descriptions, and the VIVO-ISF ontology.
The document summarizes Anita de Waard's presentation on Elsevier's experiments with big and small data. It discusses Elsevier's work with text mining and knowledge graphs to extract information from over 14 million articles. It also describes Elsevier's Medical Graph which predicts the probability of over 2,000 medical conditions occurring based on analysis of clinical data from 6 million patients. Finally, it reviews Elsevier's various tools and services to help researchers preserve, process, share, comprehend, access, and discover research data and publications.
Research data management (RDM) and the FAIR principles (Findable, Accessible, Interoperable, Reusable) are widely
promoted as basis for a shared research data infrastructure. Nevertheless, researchers involved in next generation
sequencing (NGS) still lack adequate RDM solutions. The NGS metadata is generally not stored together with the raw
NGS data, but kept by individual researchers in separate files. This situation complicates RDM practice. Moreover,
the (meta)data does often not meet the FAIR principles [6]. Consequently, a central FAIR-compliant repository
is highly desirable to support NGS related research. We have selected iRODS (Rule-Oriented Data management
systems) [3] as a basis for implementing a sequencing data repository because it allows storing both data and metadata
together. iRODS serves as scalable middleware to access different storage facilities in a centralized and virtualized
way, and supports different types of clients. This repository will be part of an ecosystem of RDM solutions that
cover complementary phases of the research data life cycle in our organization (Academic Medical Center of the
University of Amsterdam). We selected Virtuoso [5] to enrich the metadata from iRODS to enable the management
of a triplestore for linked data. The metadata in the iCat (iRODS’ metadata catalogue) and the ontology in Virtuoso
are kept synchronized by enforcement of strict data manipulation policies. We have implemented a prototype to
preserve raw sequencing data for one research group. Three iRODS client interfaces are used for different purposes:
Davrods [4] for data and metadata ingestion, data retrieval; Metalnx-web [7] for administration, data curation, and
repository browsing; and iCommands [2] for all tasks by advanced users. Different user profiles are defined (principal
investigator, data curator, repository administrator), with different access rights. New data is ingested by copying raw
sequence files and the corresponding metadata file (a sample sheet) to the landing collection on iRODS. An iRODS
rule is triggered by the sample sheet file, which extracts the metadata and registers it to the iCAT as AVU (Attribute,
Value and Unit). Ontology files are registered into Virtuoso. The sequence files are copied to the persistent collection
and are made uniquely identifiable based on metadata. All the steps are recorded into a report file that enables
monitoring and tracking of progress and faults. Here we describe the design and implementation of the prototype,
and discuss the first assessment results. Initial results indicate that the proposed solution is acceptable and fits the
researchers workflow well.
This document discusses data management requirements for predictive modeling using large datasets from multiple clinical, specimen, and lab repositories. It notes the need to assemble complete and up-to-date datasets while maintaining quality assurance and transparency. Over time, data storage systems experience problems with exponential data growth, manual data curation difficulties, and challenges integrating heterogeneous databases across different research groups. The document examines a spectrum of potential data management approaches and highlights collaborative networks and use of open source platforms as ways to address these issues.
Claudia medina: Linking Health Records for Population Health Research in Brazil.Flávio Codeço Coelho
The document discusses record linkage, which is the process of identifying and merging records from different databases that refer to the same individual. It describes common record linkage approaches used in Brazil's health sector, including probabilistic and deterministic methods. It also evaluates the accuracy of applying a probabilistic record linkage strategy to identify deaths among AIDS cases reported to Brazil's surveillance database, finding a sensitivity of 87.6% and specificity of 99.6%. Finally, it discusses the potential impact of linkage errors on risk ratio estimates in longitudinal mortality studies.
Introduction to the hands on session on "Standards and tools for model management" at the ICSB 2015.
Focus on COMBINE standards, tools for search, version control and archiving. Used management platform is SEEK.
This document discusses SED-ML (Simulation Experiment Description Markup Language), a standard for describing computational simulations. SED-ML files contain information like the models, data, simulation settings and algorithms used in an experiment. Using SED-ML allows experiments to be reproduced and shared. The document encourages adopting SED-ML to make research more reproducible and help curation of models in repositories. It also provides an overview of tools that support SED-ML and ways to get involved in its development.
Reproducibility and Scientific Research: why, what, where, when, who, how Carole Goble
This document discusses the importance of reproducibility in scientific research. It makes three key points:
1. For results to be considered valid, scientific publications should provide clear descriptions of methods and protocols so that other researchers can successfully repeat and extend the work.
2. Many factors can undermine reproducibility, such as publication pressures, poor training, disorganization, and outright fraud. Ensuring reproducible research requires transparency across experimental designs, data, software, and computational workflows.
3. Achieving reproducible science is challenging and poorly incentivized due to the resources and time required to prepare materials for independent verification. Overcoming these issues will require collective effort across the research community.
The document discusses different text-based database retrieval systems for accessing biological data, including Entrez, SRS, and DBGET/LinkDB. It describes their key features and how each system allows users to search text databases using queries, with Entrez providing linked related data across multiple databases. An example shows how each system can be used to retrieve and view related information for a SwissProt protein entry.
The document discusses biological databases and retrieval systems. It provides an overview of Entrez, a retrieval system developed by NCBI that allows integrated searches across multiple biological databases. It also describes how Entrez links related data between databases, and some key features of Entrez like limits, preview/index, and history. Additionally, it summarizes specific NCBI databases accessible through Entrez like PubMed and OMIM, as well as another retrieval system called SRS maintained by EBI.
GenBank, EMBL, and DDBJ are primary nucleotide sequence databases that collaborate to store publicly available DNA sequences. NCBI's GenBank is one of the largest primary sequence databases, containing over 240,000 organisms' sequences submitted from laboratories. PubMed and Entrez are literature and biomedical databases maintained by NCBI that allow users to search biomedical research articles and integrate related data from multiple sources. SRS is a sequence retrieval system developed by EBI that integrates over 250 molecular biology databases and allows complex queries across data sources.
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
Talk presented at Early Detection of Cancer Conference, OHSU, Portland, Oregon USA, 2-4 Oct 2018, http://earlydetectionresearch.com/ in the Data Science session
Bioinformatics databases: Current Trends and Future PerspectivesUniversity of Malaya
Data is the most powerful resource in any field or subject of study. In Biology, data comes from scientists and their actions, while any institution that makes sense of the data collected, will be in the forefront in their respective research field. In the beginning of any data collection endeavour, it is critical to find proper management techniques to store data and to maximise its utilisation. This presentation reflects upon the current trends and techniques of data modeling, architecture with a highlight on the uses of database, focusing on Bioinformatics examples and case studies. Finally, the future of bioinformatics databases is highlighted to give an overview of the modeling techniques to accommodate the biological data escalation in coming years.
This document discusses databases in bioinformatics. It begins by noting the rapid increase in biological data from sources like gene sequences, protein sequences, structural data, and gene expression data. It then defines biological databases as structured, searchable collections of data that are periodically updated and cross-referenced. The major purposes of databases are to make biological data available, systematize the data, and allow analysis of computed biological data. The document provides a brief history of biological databases and sequencing efforts. It also classifies biological databases based on data type, maintenance status, data access, data sources, database design, and organism. Specific databases discussed include DDBJ, EMBL, GenBank, Swiss-Prot, and NCB
This document discusses leveraging graph data structures to analyze variant data and related annotations from large genomic datasets. In phase I, simple queries on a graph database had performance speeds better than or equal to a relational database. Complex queries exploring patterns and clusters were also possible. In phase II, spectral clustering of 1000 genomes data identified three main clusters supporting known population genetics patterns, demonstrating the potential of graph databases for mining complex genomic correlations. The results indicate a graph database provides an effective approach for precision cancer research by enabling both known and novel queries on large genomic datasets.
This document discusses leveraging graph data structures to analyze variant data and related annotations from large genomic datasets in a scalable way. An in-memory graph database was used to model variants, annotations, and their relationships. Simple queries on the graph performed as well or better than a relational database. More complex queries and analysis, like spectral clustering of populations, were also possible with the graph model and helped identify patterns not feasible with relational approaches. The results indicate graph databases are a powerful tool for precision medicine research by enabling both known and novel analysis of large genomic datasets.
This document discusses data and model management in systems biology. It covers topics such as data ownership, metadata, ontologies, standards for encoding models and analyses, and tools for working with systems biology models and data. Standards like SBML, SBGN, SED-ML and COMBINE Archive allow for structured representation, visualization, simulation, and sharing of models and data. Resources like SEEK enable curation, simulation and publication of models in a findable, accessible, interoperable and reusable (FAIR) manner.
This document provides information on biological databases, including their history, features, and classifications. It notes that the first protein sequenced was insulin in 1965, and the first genome sequenced was of a virus in 1995. Key features of biological databases discussed include their heterogeneity, high volume of data, uncertainty, data curation, integration, sharing, and dynamic nature as new data is added. Biological databases can be classified by data type, maintainer status, data access, source, design, and organism covered. The purpose of biological databases is to systematically organize and make available vast amounts of complex biological data.
Science is rapidly being brought into the electronic realm and electronic laboratory notebooks (ELN) are a big part of this activity. The representation of the scientific process in the context of an ELN is an important component to making the data recorded in ELNs semantically integrated.
This presentation outlined initial developments of an Electronic Notebook Ontology (ENO) that will help tie together the ExptML ontology, HCLS Community Profile data descriptions, and the VIVO-ISF ontology.
The document summarizes Anita de Waard's presentation on Elsevier's experiments with big and small data. It discusses Elsevier's work with text mining and knowledge graphs to extract information from over 14 million articles. It also describes Elsevier's Medical Graph which predicts the probability of over 2,000 medical conditions occurring based on analysis of clinical data from 6 million patients. Finally, it reviews Elsevier's various tools and services to help researchers preserve, process, share, comprehend, access, and discover research data and publications.
Research data management (RDM) and the FAIR principles (Findable, Accessible, Interoperable, Reusable) are widely
promoted as basis for a shared research data infrastructure. Nevertheless, researchers involved in next generation
sequencing (NGS) still lack adequate RDM solutions. The NGS metadata is generally not stored together with the raw
NGS data, but kept by individual researchers in separate files. This situation complicates RDM practice. Moreover,
the (meta)data does often not meet the FAIR principles [6]. Consequently, a central FAIR-compliant repository
is highly desirable to support NGS related research. We have selected iRODS (Rule-Oriented Data management
systems) [3] as a basis for implementing a sequencing data repository because it allows storing both data and metadata
together. iRODS serves as scalable middleware to access different storage facilities in a centralized and virtualized
way, and supports different types of clients. This repository will be part of an ecosystem of RDM solutions that
cover complementary phases of the research data life cycle in our organization (Academic Medical Center of the
University of Amsterdam). We selected Virtuoso [5] to enrich the metadata from iRODS to enable the management
of a triplestore for linked data. The metadata in the iCat (iRODS’ metadata catalogue) and the ontology in Virtuoso
are kept synchronized by enforcement of strict data manipulation policies. We have implemented a prototype to
preserve raw sequencing data for one research group. Three iRODS client interfaces are used for different purposes:
Davrods [4] for data and metadata ingestion, data retrieval; Metalnx-web [7] for administration, data curation, and
repository browsing; and iCommands [2] for all tasks by advanced users. Different user profiles are defined (principal
investigator, data curator, repository administrator), with different access rights. New data is ingested by copying raw
sequence files and the corresponding metadata file (a sample sheet) to the landing collection on iRODS. An iRODS
rule is triggered by the sample sheet file, which extracts the metadata and registers it to the iCAT as AVU (Attribute,
Value and Unit). Ontology files are registered into Virtuoso. The sequence files are copied to the persistent collection
and are made uniquely identifiable based on metadata. All the steps are recorded into a report file that enables
monitoring and tracking of progress and faults. Here we describe the design and implementation of the prototype,
and discuss the first assessment results. Initial results indicate that the proposed solution is acceptable and fits the
researchers workflow well.
This document discusses data management requirements for predictive modeling using large datasets from multiple clinical, specimen, and lab repositories. It notes the need to assemble complete and up-to-date datasets while maintaining quality assurance and transparency. Over time, data storage systems experience problems with exponential data growth, manual data curation difficulties, and challenges integrating heterogeneous databases across different research groups. The document examines a spectrum of potential data management approaches and highlights collaborative networks and use of open source platforms as ways to address these issues.
Claudia medina: Linking Health Records for Population Health Research in Brazil.Flávio Codeço Coelho
The document discusses record linkage, which is the process of identifying and merging records from different databases that refer to the same individual. It describes common record linkage approaches used in Brazil's health sector, including probabilistic and deterministic methods. It also evaluates the accuracy of applying a probabilistic record linkage strategy to identify deaths among AIDS cases reported to Brazil's surveillance database, finding a sensitivity of 87.6% and specificity of 99.6%. Finally, it discusses the potential impact of linkage errors on risk ratio estimates in longitudinal mortality studies.
Introduction to the hands on session on "Standards and tools for model management" at the ICSB 2015.
Focus on COMBINE standards, tools for search, version control and archiving. Used management platform is SEEK.
This document discusses SED-ML (Simulation Experiment Description Markup Language), a standard for describing computational simulations. SED-ML files contain information like the models, data, simulation settings and algorithms used in an experiment. Using SED-ML allows experiments to be reproduced and shared. The document encourages adopting SED-ML to make research more reproducible and help curation of models in repositories. It also provides an overview of tools that support SED-ML and ways to get involved in its development.
Reproducibility and Scientific Research: why, what, where, when, who, how Carole Goble
This document discusses the importance of reproducibility in scientific research. It makes three key points:
1. For results to be considered valid, scientific publications should provide clear descriptions of methods and protocols so that other researchers can successfully repeat and extend the work.
2. Many factors can undermine reproducibility, such as publication pressures, poor training, disorganization, and outright fraud. Ensuring reproducible research requires transparency across experimental designs, data, software, and computational workflows.
3. Achieving reproducible science is challenging and poorly incentivized due to the resources and time required to prepare materials for independent verification. Overcoming these issues will require collective effort across the research community.
The document discusses different text-based database retrieval systems for accessing biological data, including Entrez, SRS, and DBGET/LinkDB. It describes their key features and how each system allows users to search text databases using queries, with Entrez providing linked related data across multiple databases. An example shows how each system can be used to retrieve and view related information for a SwissProt protein entry.
The document discusses biological databases and retrieval systems. It provides an overview of Entrez, a retrieval system developed by NCBI that allows integrated searches across multiple biological databases. It also describes how Entrez links related data between databases, and some key features of Entrez like limits, preview/index, and history. Additionally, it summarizes specific NCBI databases accessible through Entrez like PubMed and OMIM, as well as another retrieval system called SRS maintained by EBI.
GenBank, EMBL, and DDBJ are primary nucleotide sequence databases that collaborate to store publicly available DNA sequences. NCBI's GenBank is one of the largest primary sequence databases, containing over 240,000 organisms' sequences submitted from laboratories. PubMed and Entrez are literature and biomedical databases maintained by NCBI that allow users to search biomedical research articles and integrate related data from multiple sources. SRS is a sequence retrieval system developed by EBI that integrates over 250 molecular biology databases and allows complex queries across data sources.
Being FAIR: Enabling Reproducible Data ScienceCarole Goble
Talk presented at Early Detection of Cancer Conference, OHSU, Portland, Oregon USA, 2-4 Oct 2018, http://earlydetectionresearch.com/ in the Data Science session
Bioinformatics databases: Current Trends and Future PerspectivesUniversity of Malaya
Data is the most powerful resource in any field or subject of study. In Biology, data comes from scientists and their actions, while any institution that makes sense of the data collected, will be in the forefront in their respective research field. In the beginning of any data collection endeavour, it is critical to find proper management techniques to store data and to maximise its utilisation. This presentation reflects upon the current trends and techniques of data modeling, architecture with a highlight on the uses of database, focusing on Bioinformatics examples and case studies. Finally, the future of bioinformatics databases is highlighted to give an overview of the modeling techniques to accommodate the biological data escalation in coming years.
1. The document discusses how a biologist, Marco Roos, became interested in e-science through his work in molecular and cellular biology, bioinformatics, and data integration projects.
2. Roos describes how e-science allows for collaboration between different experts and disciplines through technologies like workflows, semantic web, and virtual laboratories.
3. Roos emphasizes that e-science should empower scientists by making tools and resources easy to use, share, and build upon so that scientists can focus on scientific problems rather than technical challenges.
Dynamic Semantic Metadata in Biomedical CommunicationsTim Clark
1) The document discusses challenges in curing complex medical disorders and proposes that semantic annotation, hypothesis management, and nanopublications can help address these challenges by enabling improved information sharing and integration across research communities.
2) It describes various technologies and frameworks like the Annotation Ontology, SWAN Annotation Framework, and nanopublications that can help researchers semantically annotate documents, manage hypotheses, and publish and share interpretations.
3) International collaborations between researchers and informaticians are seen as important to building the information ecosystem needed to make progress on curing complex diseases.
This document discusses using semantic web technologies for translational research in life sciences. It provides an overview of semantic web standards and outlines several projects demonstrating applications in healthcare and biomedical research. These include developing an active semantic electronic medical record, semantically annotating experimental glycomics data, and integrating diverse biomedical data sources using ontologies to enable complex querying and knowledge discovery.
Semantic Web for Health Care and Biomedical InformaticsAmit Sheth
Amit Sheth, "Semantic Web for Health Care and Biomedical Informatics," Keynote at NSF Biomed Web Workshop, Corbett, Oregon, December 4-5, 2007.
http://www.biomedweb.info/2007/
The document discusses the increasing scale and complexity of knowledge generation in science domains like astronomy and medicine over recent centuries. It argues that knowledge generation can be viewed as a systems problem involving many actors and processes. The document proposes a service-oriented approach using web services as an integrating framework to address challenges of scale, complexity, and distributed collaboration in e-Science. Key challenges discussed include semantics, documentation, scaling issues, and sociological factors like incentives.
Investigating plant systems using data integration and network analysisCatherine Canevet
The document discusses challenges in integrating plant data from multiple sources and proposes solutions. It notes that plant data is sparse, distributed across many databases in various formats, and focused primarily on the model plant Arabidopsis. Data integration is necessary to address key biological questions by consolidating information from pathway databases, gene annotations, protein interactions, and more. The document outlines approaches to data integration including controlled vocabularies, ontologies, data standards, and integration applications specifically designed to combine data sources like Ondex. Effective integration is important to fully leverage available plant data.
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksCarole Goble
Keynote presentation at the iConference 2015, Newport Beach, Los Angeles, 26 March 2015.
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
http://ischools.org/the-iconference/
BEWARE: presentation includes hidden slides AND in situ build animations - best viewed by downloading.
The document summarizes the experience of a biologist in adopting an e-science approach to their work. It describes how before e-science, the biologist took an uncoordinated "spaghetti" approach using various tools without a unified strategy. The biologist then explains how adopting e-science principles like collaboration, reusable workflows, and web services helped enhance their work by allowing experts from different domains to combine their expertise. The biologist also reflects on outreach efforts to promote e-science to other researchers.
Data analysis & integration challenges in genomicsmikaelhuss
Presentation given at the Genomics Today and Tomorrow event in Uppsala, Sweden, 19 March 2015. (http://connectuppsala.se/events/genomics-today-and-tomorrow/) Topics include APIs, "querying by data set", machine learning.
The document discusses the growth of data-intensive science and the need for new computing infrastructures to manage the large amounts of data being produced. It covers three perspectives on infrastructure: grid computing which enables sharing of distributed resources over the internet, data centers which provide integrated storage and computing services, and e-science which combines grids, collaboration tools, and data analysis services. Examples are given of different scientific domains using these infrastructures.
Bioinformatics is an interdisciplinary field that combines biology, computer science, and information technology. It involves the electronic storage, retrieval, analysis, and correlation of biological data. The document outlines key concepts in bioinformatics including the central dogma of molecular biology, biological data representation, how computers can be useful for biology, challenges in the field, and examples of intelligent bioinformatics applications. It emphasizes that bioinformatics is an important and growing field at the intersection of biology and computer science.
The document discusses the ISA infrastructure, which provides a standardized format (ISA-TAB) for experimental metadata and data exchange. It can be used across various domains like toxicology, systems biology, and nanotechnology. The Risa R package integrates experimental metadata with analysis and allows updating metadata. Nature Scientific Data is a new publication for describing valuable datasets. The ISA framework has been adopted by over 30 public and private resources and is growing in use for facilitating reuse of investigations in various life science domains. Toxicity examples include EU projects on predictive toxicology and a rat study of drug candidates. Questions can be directed to the ISA tools group.
Scott Edmunds slides for class 8 from the HKU Data Curation (module MLIM7350 from the Faculty of Education) course covering science data, medical data and ethics, and the FAIR data principles.
This document summarizes Professor Carole Goble's presentation on making research more reproducible and FAIR (Findable, Accessible, Interoperable, Reusable) through the use of research objects and related standards and infrastructure. It discusses challenges to reproducibility in computational research and proposes bundling datasets, workflows, software and other research products into standardized research objects that can be cited and shared to help address these challenges.
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.
Keynote presented at the Phenotype Foundation first annual meeting.
Describes data sharing, data annotation and the needs for further tool and ontology and ontology mapping development.
Amsterdam, January 18, 2016
Opening up pharmacological space, the OPEN PHACTs apiChris Evelo
The document provides an overview of the Open PHACTS project, which aims to create an open pharmacological space (OPS) through semantic integration of public drug discovery resources. It discusses the challenges of accessing and integrating scientific data across organizational boundaries. Open PHACTS builds a service layer and applications to allow standardized access and analysis of data from various public sources. It is a collaborative project involving academic and industry partners seeking to make pre-competitive drug discovery data more accessible and useful through semantic integration and common standards.
WikiPathways: how open source and open data can make omics technology more us...Chris Evelo
This document discusses WikiPathways, an open source pathway database. It began in 2007 with the goals of having an online platform by March 2007 and gaining a first unknown user by January 2008, both of which were successes. WikiPathways has grown significantly since, now containing over 400 human pathways and 6,200 unique human genes. It receives over 1 million pageviews annually. The document advocates for opening up data and code to make omics technology more useful. It describes WikiPathways' various features including its BioPAX format, REST services, and integration with Cytoscape. It also discusses professionalizing open source and collaborating with existing communities and tools rather than trying to change the world alone.
A real life example to show the added value of the Phenotype Database (dbNP)....Chris Evelo
NuGO has initiated the development of the Phenotype Database (dbNP). This database is developed together with several other consortia (e.g. Netherlands Metabolomics Centre) and is currently used within several European projects, such as Food4me, NU-AGE, Bioclaims and Nutritech.
The Phenotype Database (www.dbnp.org) is a web-based application/database that can store any biological study. We used this application to perform an analysis on a combination of several studies with the objective to test if it is possible to answer new research questions using a ‘virtual cohort’.
Study comparison:
The assessment of the health status of an individual is an important but challenging issue. Nowadays, challenge tests are proposed as a method to assess and quantify health status. We would like to find mechanistic explanations for differences in clinical subgroups and to develop a metabolomics platform based fingerprint at baseline that represents important parameters of the challenge test. Currently, there is not one single study available that includes enough subjects from specific clinical subgroups to develop such a fingerprint or study the biological processes specific for those subgroups. Therefore, we developed a toolbox that facilitates the combined analysis of multiples studies.
Presentation pathway extensions using knowledge integration and network approaches presented at the Systems Biology Institute in Luxembourg on November 28 2012.
Using ontologies to do integrative systems biologyChris Evelo
The document discusses using ontologies to integrate systems biology data. It describes typical steps in systems biology studies such as finding studies, processing data, integrating data, and combining data from multiple sources. Ontologies can help link information from different analysis techniques and combine data from many studies by capturing study metadata. The document advocates using standards like ISA-TAB and MAGE-TAB to capture study data and proposes using a generic study capture framework with modular components to integrate different types of 'omics data. Ontologies are needed for collaboration and to provide controlled vocabularies for annotation.
Using biological network approaches for dynamic extension of micronutrient re...Chris Evelo
This document discusses using biological network approaches to dynamically extend pathways with regulatory information such as microRNAs (miRNAs). It describes tools like PathVisio that can integrate gene expression, proteomics and metabolomics data onto pathways to identify significantly changed processes. WikiPathways is introduced as a public pathway resource that can be contributed to and curated by researchers. The document outlines approaches for visualizing regulatory interactions on pathways using plugins, exploring pathway interactions through network analysis, and integrating other data types such as SNPs, fluxes and gene annotations to build a more comprehensive understanding of biological systems.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxMAGOTI ERNEST
Although Artemia has been known to man for centuries, its use as a food for the culture of larval organisms apparently began only in the 1930s, when several investigators found that it made an excellent food for newly hatched fish larvae (Litvinenko et al., 2023). As aquaculture developed in the 1960s and ‘70s, the use of Artemia also became more widespread, due both to its convenience and to its nutritional value for larval organisms (Arenas-Pardo et al., 2024). The fact that Artemia dormant cysts can be stored for long periods in cans, and then used as an off-the-shelf food requiring only 24 h of incubation makes them the most convenient, least labor-intensive, live food available for aquaculture (Sorgeloos & Roubach, 2021). The nutritional value of Artemia, especially for marine organisms, is not constant, but varies both geographically and temporally. During the last decade, however, both the causes of Artemia nutritional variability and methods to improve poorquality Artemia have been identified (Loufi et al., 2024).
Brine shrimp (Artemia spp.) are used in marine aquaculture worldwide. Annually, more than 2,000 metric tons of dry cysts are used for cultivation of fish, crustacean, and shellfish larva. Brine shrimp are important to aquaculture because newly hatched brine shrimp nauplii (larvae) provide a food source for many fish fry (Mozanzadeh et al., 2021). Culture and harvesting of brine shrimp eggs represents another aspect of the aquaculture industry. Nauplii and metanauplii of Artemia, commonly known as brine shrimp, play a crucial role in aquaculture due to their nutritional value and suitability as live feed for many aquatic species, particularly in larval stages (Sorgeloos & Roubach, 2021).
Nucleophilic Addition of carbonyl compounds.pptxSSR02
Nucleophilic addition is the most important reaction of carbonyls. Not just aldehydes and ketones, but also carboxylic acid derivatives in general.
Carbonyls undergo addition reactions with a large range of nucleophiles.
Comparing the relative basicity of the nucleophile and the product is extremely helpful in determining how reversible the addition reaction is. Reactions with Grignards and hydrides are irreversible. Reactions with weak bases like halides and carboxylates generally don’t happen.
Electronic effects (inductive effects, electron donation) have a large impact on reactivity.
Large groups adjacent to the carbonyl will slow the rate of reaction.
Neutral nucleophiles can also add to carbonyls, although their additions are generally slower and more reversible. Acid catalysis is sometimes employed to increase the rate of addition.
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...AbdullaAlAsif1
The pygmy halfbeak Dermogenys colletei, is known for its viviparous nature, this presents an intriguing case of relatively low fecundity, raising questions about potential compensatory reproductive strategies employed by this species. Our study delves into the examination of fecundity and the Gonadosomatic Index (GSI) in the Pygmy Halfbeak, D. colletei (Meisner, 2001), an intriguing viviparous fish indigenous to Sarawak, Borneo. We hypothesize that the Pygmy halfbeak, D. colletei, may exhibit unique reproductive adaptations to offset its low fecundity, thus enhancing its survival and fitness. To address this, we conducted a comprehensive study utilizing 28 mature female specimens of D. colletei, carefully measuring fecundity and GSI to shed light on the reproductive adaptations of this species. Our findings reveal that D. colletei indeed exhibits low fecundity, with a mean of 16.76 ± 2.01, and a mean GSI of 12.83 ± 1.27, providing crucial insights into the reproductive mechanisms at play in this species. These results underscore the existence of unique reproductive strategies in D. colletei, enabling its adaptation and persistence in Borneo's diverse aquatic ecosystems, and call for further ecological research to elucidate these mechanisms. This study lends to a better understanding of viviparous fish in Borneo and contributes to the broader field of aquatic ecology, enhancing our knowledge of species adaptations to unique ecological challenges.
4. Integrative Systems Biology
Internal &
external
data
repositories
e.g. dbNP,
Sage, Atlas
knowledge
resources &
(semantic web)
Integration
e.g. Open PHACTS
WikiPathways
study capturing
ISA
models
study
data
processing,
statistics,
storage
e.g. arrayanalysis.org
ontologies
modeling & data integration,
network biology (extension),
supervised statistics
curation,
simulation
annotation &
provenance
research
applications
mapping
BridgeDb
extraction,
SPARQLing
conversion
5. We can do things like this (diabetic liver)
Pihlajamäki et al. dataset
is from Gene Expression
Omnibus
GEO:GSE15653
Pihlajamäki et al. J Clin
Endocrinol Metab. 2009,
94 (9): 3521-3529.
DOI: 10.1210/jc.2009-
0212.
Martina Kutmon et al.
BMC Genomics 2014,
15:971.
DOI:10.1186/1471-2164-
15-971
8. How do pharma companies use public data?
Pfizer
AZ
Roche
n
9.
10. Nanopu
b
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor
2a”
VoID
Db
Nanopu
b
Db
VoID
Db
VoID
Nanopu
b
VoID
Public Content Commercial
Public
Ontologies
User
Annotations
Apps
11. Nanopu
b
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor
2a”
VoID
Db
Nanopu
b
Db
VoID
Db
VoID
Nanopu
b
VoID
Public Content Commercial
Public
Ontologies
User
Annotations
Apps
From: https://xkcd.com/927/
This is basically why we in COST CHARME and in ELIXIR intereoperability platform work on the glue between standards
Animated slide
Showing data and knowledge resources on the left (you can use FAIRsharing to find these). Results are mined from these, combined (where the gluing occurs) and used for AI. This talks focusses on the combine aspects. If you do that correctly the AI part lateron will not have to make the connections and the power can be used to obtain other rseults
This slide shows the overall approach and the position of the different components/projects
This study shows we can actually find useful data (in this case liver transcriptomics from human diabetic patients compared to non-diabetic patient data). Process the data, perform pathway enrichment analysis (map to the entities in the pathways) combine pathways into a network (combining overlapping pathways that in the case of WikiPathways need to be mapped again). Extend that network with transcription factors (from e.g. ENCODE, again targets need to be mapped) and look for active nodes in the networks and find the transcription factors that affect these active nodes (essentially turning the network inside out). The result shows the main known transcription factors in diabetics, however using information from just one study not he wealth of information that makes this known.
A little bit more illustration of how that was done Showing the pathways affected, the overlapping entities and a small representation of the resulting network
How pharma reused public data
And every company does the same
So for the Open PHACTS project we had the idea to link the relevant data together. Using data from ChEMBL (compound-target), NextProt (on the targets), WikiPathways and Reactome (processes and pathways these targets are involved in) and DisGeNET linking the genes coding for these targets to diseases. All that using a. semantic web approach that would make it one big linked dataset
Alart from the fact that such data is not automatically linked even if you describe it well
So we added resources that map between textual concepts, ontology terms, database IDs and chemical structures
Some of such mapping tools are now part of ELIXIR’s recommended interoperability services
E.g. BridgeDb able to map gene and gene product database IDs and metabolite IDs and more. A BridgeDb based identifier mapping service was part of the original Open PHACTS
The ontology mapping and crossreference service OxO at EBI, which has not yet been deployed to work on such tasks, but offers the potential to do so
While the CDK could be used to develop a service that can resolve siubstructures, replacing the original service that was ”not the best open source” project
And we need classic bioinformatics approaches to map between similar services, domains affected, predicted functions of variants and map SNPs to Indels and such
And if we get all that done we might be able to reuse data the way rocket scientists reuse their actual rockets (showing the landing of the two side boosters of the very first Falcon Heavy and the first time two of such boosters made it back to earth at the same time, potentially able to be reused)