This document summarizes a presentation about the EnVisioning Pathways project. It discusses:
1) The EnCORE integration platform developed by the ENFIN Network of Excellence to enable mining data across different biological domains, sources, formats and types through a standardized XML format and web services.
2) Examples of EnCORE services that retrieve interaction data from databases like IntAct and Pride, map between identifiers, and represent results in biological pathways in databases like Reactome.
3) Efforts to adapt EnCORE to utilize standards and create a federated system to integrate information from different biological domains. This includes building predefined and user-selected workflows between EnCORE services.
Accessing small molecule data using ChEBIDuncan Hull
This document summarizes a presentation about accessing chemical data using the ChEBI database. It introduces ChEBI as a manually annotated database and ontology of small chemical entities. It covers searching and browsing ChEBI, understanding the ChEBI ontology structure, and methods for programmatic access including downloads of the ChEBI data in different file formats and via a web service API.
This document discusses several data integration tools: DAS, PSICQUIC, EnFIN, EnCORE, and Biomart. DAS is a distributed annotation system that allows uniform access to biological data from multiple repositories. PSICQUIC integrates molecular interaction data based on the PSI-MI standard. EnFIN, EnCORE, and EnVISION provide data integration across various domains, sources, formats and types by standardizing data in an EnXML format and developing web services. Biomart allows federated querying of biological data across different databases through a common query interface.
Data integration in Proteomics through EnVision and EnCore Web ServicesRafael C. Jimenez
The document discusses EnCore, a platform developed by the ENFIN Network of Excellence to enable data integration across various biological domains and data sources. EnCore provides a set of web services and the EnVision interface to allow users to query multiple databases and analyze the results in a standardized format. It is moving towards a new approach based on standards and federation to access more external sources in a interoperable way and provide added value to the original data.
This document provides a tutorial for using EnCORE, a tool that allows biologists to analyze biological data and receive outputs from multiple databases and web services. It describes the EnCORE interface and how to perform searches, view and analyze results from tools like PICR, Pride, Reactome, IntAct, CellMint and BioModels. The tutorial explains how to create new queries, select input types, submit jobs, view results overview pages and dataset logs, download XML files, and manage saved datasets. It also demonstrates how to combine datasets and view combined results.
This document provides an overview of data integration in biology, including why it is needed, common problems, and popular approaches. It discusses the many different biological data sources and standards that have been developed for integration. Different architectures for data integration are described, including data warehousing, federation, and view integration. Key variables that affect integration like scope, domain, and interfaces are outlined. Important standards, ontologies, guidelines and tools that support integration are also reviewed.
The document summarizes a presentation given by Rafael Jimenez on March 25, 2010 about EnCORE and EnVision. EnCORE is a platform developed by the ENFIN Network of Excellence to enable mining data across different sources and formats by integrating databases and analysis tools. It uses a standardized EnXML format. EnVision is a user interface that allows querying multiple data sources simultaneously and visualizing the results. The presentation discusses how EnCORE and EnVision can adopt standards and enable federation to improve data integration and provide additional value to source data.
This document discusses data integration in bioinformatics. It begins by explaining why data integration is needed due to the large number of specialized databases and diversity of data types. It then defines data integration as combining data from different sources into a unified view. Some of the challenges of data integration mentioned include different data schemas, interfaces and vocabularies between databases. Several common approaches to data integration are described, including data centralization, federated databases and view integration. Important variables that affect integration approaches are also outlined, such as the domain, architecture and query interface. Finally, some examples of commonly used tools for tasks like workflow management, web services and format standards are provided.
This document provides an overview of data integration and uses the Distributed Annotation System (DAS) as an example. It discusses how DAS allows for the combining of biological sequence data from different sources through a standardized protocol. It then describes how to use a card game to intuitively teach students about key DAS concepts like data distribution, service-oriented architecture, and standardization. The game has players take on roles within the DAS system like client, source, and registry to reinforce these concepts through collaborative play.
Accessing small molecule data using ChEBIDuncan Hull
This document summarizes a presentation about accessing chemical data using the ChEBI database. It introduces ChEBI as a manually annotated database and ontology of small chemical entities. It covers searching and browsing ChEBI, understanding the ChEBI ontology structure, and methods for programmatic access including downloads of the ChEBI data in different file formats and via a web service API.
This document discusses several data integration tools: DAS, PSICQUIC, EnFIN, EnCORE, and Biomart. DAS is a distributed annotation system that allows uniform access to biological data from multiple repositories. PSICQUIC integrates molecular interaction data based on the PSI-MI standard. EnFIN, EnCORE, and EnVISION provide data integration across various domains, sources, formats and types by standardizing data in an EnXML format and developing web services. Biomart allows federated querying of biological data across different databases through a common query interface.
Data integration in Proteomics through EnVision and EnCore Web ServicesRafael C. Jimenez
The document discusses EnCore, a platform developed by the ENFIN Network of Excellence to enable data integration across various biological domains and data sources. EnCore provides a set of web services and the EnVision interface to allow users to query multiple databases and analyze the results in a standardized format. It is moving towards a new approach based on standards and federation to access more external sources in a interoperable way and provide added value to the original data.
This document provides a tutorial for using EnCORE, a tool that allows biologists to analyze biological data and receive outputs from multiple databases and web services. It describes the EnCORE interface and how to perform searches, view and analyze results from tools like PICR, Pride, Reactome, IntAct, CellMint and BioModels. The tutorial explains how to create new queries, select input types, submit jobs, view results overview pages and dataset logs, download XML files, and manage saved datasets. It also demonstrates how to combine datasets and view combined results.
This document provides an overview of data integration in biology, including why it is needed, common problems, and popular approaches. It discusses the many different biological data sources and standards that have been developed for integration. Different architectures for data integration are described, including data warehousing, federation, and view integration. Key variables that affect integration like scope, domain, and interfaces are outlined. Important standards, ontologies, guidelines and tools that support integration are also reviewed.
The document summarizes a presentation given by Rafael Jimenez on March 25, 2010 about EnCORE and EnVision. EnCORE is a platform developed by the ENFIN Network of Excellence to enable mining data across different sources and formats by integrating databases and analysis tools. It uses a standardized EnXML format. EnVision is a user interface that allows querying multiple data sources simultaneously and visualizing the results. The presentation discusses how EnCORE and EnVision can adopt standards and enable federation to improve data integration and provide additional value to source data.
This document discusses data integration in bioinformatics. It begins by explaining why data integration is needed due to the large number of specialized databases and diversity of data types. It then defines data integration as combining data from different sources into a unified view. Some of the challenges of data integration mentioned include different data schemas, interfaces and vocabularies between databases. Several common approaches to data integration are described, including data centralization, federated databases and view integration. Important variables that affect integration approaches are also outlined, such as the domain, architecture and query interface. Finally, some examples of commonly used tools for tasks like workflow management, web services and format standards are provided.
This document provides an overview of data integration and uses the Distributed Annotation System (DAS) as an example. It discusses how DAS allows for the combining of biological sequence data from different sources through a standardized protocol. It then describes how to use a card game to intuitively teach students about key DAS concepts like data distribution, service-oriented architecture, and standardization. The game has players take on roles within the DAS system like client, source, and registry to reinforce these concepts through collaborative play.
Presentation pathway extensions using knowledge integration and network approaches presented at the Systems Biology Institute in Luxembourg on November 28 2012.
The European Bioinformatics Institute (EBI) is a center for bioinformatics research and services located in Hinxton, UK. EBI grew out of EMBL's work providing public biological databases and offers major databases on DNA, RNA, proteins, pathways, and more. EBI's website provides access to these databases as well as a variety of bioinformatics tools for sequence analysis, proteomics, microarrays, and more through different channels on their site.
The document discusses bioinformatics tools used for analyzing biological data. It begins with an introduction to bioinformatics and then describes several categories of tools: biological databases for storing genomic and protein data; homology tools for sequence alignment and comparison; protein function analysis tools; structural analysis tools; and sequence manipulation and analysis tools. Common tools discussed include BLAST, FASTA, ClustalW, and databases like GenBank. The document concludes by covering applications of bioinformatics in areas like molecular modeling, medicine, and computation.
Proteomics repositories integration using EUDAT resourcesRafael C. Jimenez
This document discusses plans to integrate proteomics data repositories using resources from the EUDAT data infrastructure. It describes replicating data from the ELIXIR repository PRIDE to EUDAT data centers for backup and access. This will test using EUDAT services like B2SAFE for replication and assigning persistent identifiers (PIDs) to datasets and files. The current status describes installing necessary software at participating sites and initial testing of replication from PRIDE to the Swedish National Bioinformatics Infrastructure data center. Future plans include syncing data changes and exploring data push/pull models between repositories.
Presentaion for NetBio SIG 2013 by Robin Haw, Scientific Associate and Outreach Coordinator, Ontario Institute for Cancer Research. “Reactome Knowledgebase and Functional Interaction (FI) Cytoscape Plugin”
GiTools is a software tool for analyzing and visualizing genomic data. It allows users to analyze data, visualize results, and integrate new analysis features over time. The tool has been used in case studies such as analyzing the binding targets of the RBP2 protein during cell differentiation. Future work will improve the tool's statistical tests, integration with other bioinformatics software, and user experience. The GiTools team continues to develop and validate the software.
Analysis and visualization of microarray experiment data integrating Pipeline...Vladimir Morozov
This document summarizes the analysis and visualization of microarray experiment data using Pipeline Pilot, Spotfire and R. Key points:
- More than 30 public and proprietary microarray experiments were analyzed using in-house software workflows in Pipeline Pilot.
- Pipeline Pilot workflows retrieve gene annotation from NCBI and produce visualizations of differential expression statistics and biological pathway regulation in Spotfire.
- The gene expression values are analyzed via custom R scripts and plotted using the R connector. Results are integrated into the company's knowledge platform.
An Overview of the iMicrobe Project and available tools in the iPlant Cyberinfrastructure. This talk was given at a workshop at ASLO in Granada, Spain focused on applications in Oceanography and Limnology.
Exploiting technical replicate variance in omics data analysis (RepExplore)Enrico Glaab
High-throughput omics datasets often contain technical replicates included to account for technical sources of noise in the measurement process. Although summarizing these replicate measurements by using robust averages may help to reduce the influence of noise on downstream data analysis, the information on the variance across the replicate measurements is lost in the averaging process and therefore typically disregarded in subsequent statistical analyses.
We introduce RepExplore, a web-service dedicated to exploit the information captured in the technical replicate variance to provide more reliable and informative differential expression and abundance statistics for omics datasets. The software builds on previously published statistical methods, which have been applied successfully to biomedical omics data but are difficult to use without prior experience in programming or scripting. RepExplore facilitates the analysis by providing a fully automated data processing and interactive ranking tables, whisker plot, heat map and principal component analysis visualizations to interpret omics data and derived statistics.
Availability and implementation: Freely available at http://www.repexplore.tk
Journal publication: http://bioinformatics.oxfordjournals.org/content/31/13/2235.long (Glaab, E., & Schneider, R. (2015). RepExplore: Addressing technical replicate variance in proteomics and metabolomics data analysis. Bioinformatics, 31 (13): 2235-2237)
Event: Plant and Animal Genomes conference 2012
Speaker: Sandra Orchard
InterPro is an open-source protein resource used for the automatic annotation of proteins, and is scalable to the analysis of entire new genomes through the use of a downloadable version of InterProScan, which can be incorporated into an existing local pipeline. InterPro integrates protein signatures from 11 major signature databases (CATH-Gene3D, HAMAP, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY, and TIGRFAMs) into a single resource, taking advantage of the different areas of specialization of each to produce a resource that provides protein classification on multiple levels: protein families, structural superfamilies and functionally close subfamilies, as well as functional domains, repeats and important sites. The InterPro website has been improved, following extensive community consultation and a new version of InterProScan promises improved speed, ease of implementation as well as additional functionalities.
NGS Management And Analysis: From Sample To Molecular And Network Biology.Arnaud Céol
The Genomic Unit of the Center for Genomic Science of IIT@SEMM processes thousands of samples on Next Generation Sequencing platforms. We will briefly present how we manage the experimental flow and data with our dedicated LIMS and facilitate primary and secondary analyses with HTS-flow, a workflow management system that has been standardized and made easily accessible to both dry and wet lab scientists. Finally, we will show how we are extending genome visualization tools to enable the integration of NGS data with molecular, network and structural biology.
Presented at the "Giornata Milanese di NGS", 2016 Apr 8, University of Milan Bicocca
Using biological network approaches for dynamic extension of micronutrient re...Chris Evelo
This document discusses using biological network approaches to dynamically extend pathways with regulatory information such as microRNAs (miRNAs). It describes tools like PathVisio that can integrate gene expression, proteomics and metabolomics data onto pathways to identify significantly changed processes. WikiPathways is introduced as a public pathway resource that can be contributed to and curated by researchers. The document outlines approaches for visualizing regulatory interactions on pathways using plugins, exploring pathway interactions through network analysis, and integrating other data types such as SNPs, fluxes and gene annotations to build a more comprehensive understanding of biological systems.
Metabolic pathway mapping against KEGG, Reactome, HMDB and CPDBDinesh Barupal
This document describes various approaches for mapping detected metabolites to metabolic pathways using online databases and tools. It discusses obtaining KEGG identifiers for metabolites, using KEGG, Reactome, MetaboAnalyst and ConsensusPathDB to map identifiers to pathways and visualize pathways with overlays of mapped metabolites. It notes some metabolites may not have identifiers or map to pathways and emphasizes mapping identified more compounds than shown on pathway maps through enrichment analysis.
Ondex is a data integration and visualization platform used to integrate large amounts of biological data from multiple sources. It transforms the data into a graph of biological concepts and relationships. Ondex allows users to integrate data, perform semantic alignment of concepts, and visualize the integrated network. Filters and annotators can then be used to highlight specific areas of interest within the large integrated network. Ondex has been applied to problems such as candidate gene prioritization, pathway mapping, and analysis of quantitative trait loci regions in plants.
OVium Bio-Information Solutions use forefront algorithms to analyze key data resources such NCBI, EBLM and PDB to develop cell signal pathways.
OVium employs cloud and MPP computing solutions with homology and signal network mapping to develop chemical and protein pathways for discovery research.
Introduction to Cytoscape talk given in March 2010 at the CRUK CRI. Cambridge UK.
It was design to give a broad introduction the features available in Cytoscape for wet lab researchers.
This document provides an overview of downstream analyses that can be performed after variant identification and filtering in a typical variant calling pipeline. It discusses visualization of variant data in each gene to identify potential causative variants. It also mentions association studies as another type of downstream analysis where variants are tested for association with disease phenotypes. The goal of downstream analyses is to help prioritize variants for further investigation.
The document discusses the development of computational analysis tools for natural products research and metabolomics. It introduces NetPathMiner, a software tool for network path mining through gene expression data. NetPathMiner allows mining of active pathways from biological networks, handles different network formats and representations, and provides visualization of pathways and networks. It also introduces NMRPro, a tool for interactive online processing of NMR spectra, which aims to address current limitations in NMR spectral processing and sharing.
The document summarizes a workshop aimed at integrating resources between several bioinformatics standards registries. The workshop agenda includes presentations from Identifiers.org, BioSharing, BMB Service Registry, EDAM ontology, and the BMB standards registry. Breakout sessions will identify overlaps and potential synergies between the registries, and define areas for collaboration. The goal is to reduce duplication of efforts and develop a common integration and development strategy across registries.
ELIXIR aims to establish a pan-European infrastructure for biological information to support life sciences research. It will do this by coordinating nodes that provide services and resources, establishing standards, and closing skills gaps. Key challenges include sustaining data and services, ensuring interoperability, and dealing with increasingly large datasets. ELIXIR is working on pilots and task forces to address issues like cloud computing, storage, authentication and authorization.
Presentation pathway extensions using knowledge integration and network approaches presented at the Systems Biology Institute in Luxembourg on November 28 2012.
The European Bioinformatics Institute (EBI) is a center for bioinformatics research and services located in Hinxton, UK. EBI grew out of EMBL's work providing public biological databases and offers major databases on DNA, RNA, proteins, pathways, and more. EBI's website provides access to these databases as well as a variety of bioinformatics tools for sequence analysis, proteomics, microarrays, and more through different channels on their site.
The document discusses bioinformatics tools used for analyzing biological data. It begins with an introduction to bioinformatics and then describes several categories of tools: biological databases for storing genomic and protein data; homology tools for sequence alignment and comparison; protein function analysis tools; structural analysis tools; and sequence manipulation and analysis tools. Common tools discussed include BLAST, FASTA, ClustalW, and databases like GenBank. The document concludes by covering applications of bioinformatics in areas like molecular modeling, medicine, and computation.
Proteomics repositories integration using EUDAT resourcesRafael C. Jimenez
This document discusses plans to integrate proteomics data repositories using resources from the EUDAT data infrastructure. It describes replicating data from the ELIXIR repository PRIDE to EUDAT data centers for backup and access. This will test using EUDAT services like B2SAFE for replication and assigning persistent identifiers (PIDs) to datasets and files. The current status describes installing necessary software at participating sites and initial testing of replication from PRIDE to the Swedish National Bioinformatics Infrastructure data center. Future plans include syncing data changes and exploring data push/pull models between repositories.
Presentaion for NetBio SIG 2013 by Robin Haw, Scientific Associate and Outreach Coordinator, Ontario Institute for Cancer Research. “Reactome Knowledgebase and Functional Interaction (FI) Cytoscape Plugin”
GiTools is a software tool for analyzing and visualizing genomic data. It allows users to analyze data, visualize results, and integrate new analysis features over time. The tool has been used in case studies such as analyzing the binding targets of the RBP2 protein during cell differentiation. Future work will improve the tool's statistical tests, integration with other bioinformatics software, and user experience. The GiTools team continues to develop and validate the software.
Analysis and visualization of microarray experiment data integrating Pipeline...Vladimir Morozov
This document summarizes the analysis and visualization of microarray experiment data using Pipeline Pilot, Spotfire and R. Key points:
- More than 30 public and proprietary microarray experiments were analyzed using in-house software workflows in Pipeline Pilot.
- Pipeline Pilot workflows retrieve gene annotation from NCBI and produce visualizations of differential expression statistics and biological pathway regulation in Spotfire.
- The gene expression values are analyzed via custom R scripts and plotted using the R connector. Results are integrated into the company's knowledge platform.
An Overview of the iMicrobe Project and available tools in the iPlant Cyberinfrastructure. This talk was given at a workshop at ASLO in Granada, Spain focused on applications in Oceanography and Limnology.
Exploiting technical replicate variance in omics data analysis (RepExplore)Enrico Glaab
High-throughput omics datasets often contain technical replicates included to account for technical sources of noise in the measurement process. Although summarizing these replicate measurements by using robust averages may help to reduce the influence of noise on downstream data analysis, the information on the variance across the replicate measurements is lost in the averaging process and therefore typically disregarded in subsequent statistical analyses.
We introduce RepExplore, a web-service dedicated to exploit the information captured in the technical replicate variance to provide more reliable and informative differential expression and abundance statistics for omics datasets. The software builds on previously published statistical methods, which have been applied successfully to biomedical omics data but are difficult to use without prior experience in programming or scripting. RepExplore facilitates the analysis by providing a fully automated data processing and interactive ranking tables, whisker plot, heat map and principal component analysis visualizations to interpret omics data and derived statistics.
Availability and implementation: Freely available at http://www.repexplore.tk
Journal publication: http://bioinformatics.oxfordjournals.org/content/31/13/2235.long (Glaab, E., & Schneider, R. (2015). RepExplore: Addressing technical replicate variance in proteomics and metabolomics data analysis. Bioinformatics, 31 (13): 2235-2237)
Event: Plant and Animal Genomes conference 2012
Speaker: Sandra Orchard
InterPro is an open-source protein resource used for the automatic annotation of proteins, and is scalable to the analysis of entire new genomes through the use of a downloadable version of InterProScan, which can be incorporated into an existing local pipeline. InterPro integrates protein signatures from 11 major signature databases (CATH-Gene3D, HAMAP, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY, and TIGRFAMs) into a single resource, taking advantage of the different areas of specialization of each to produce a resource that provides protein classification on multiple levels: protein families, structural superfamilies and functionally close subfamilies, as well as functional domains, repeats and important sites. The InterPro website has been improved, following extensive community consultation and a new version of InterProScan promises improved speed, ease of implementation as well as additional functionalities.
NGS Management And Analysis: From Sample To Molecular And Network Biology.Arnaud Céol
The Genomic Unit of the Center for Genomic Science of IIT@SEMM processes thousands of samples on Next Generation Sequencing platforms. We will briefly present how we manage the experimental flow and data with our dedicated LIMS and facilitate primary and secondary analyses with HTS-flow, a workflow management system that has been standardized and made easily accessible to both dry and wet lab scientists. Finally, we will show how we are extending genome visualization tools to enable the integration of NGS data with molecular, network and structural biology.
Presented at the "Giornata Milanese di NGS", 2016 Apr 8, University of Milan Bicocca
Using biological network approaches for dynamic extension of micronutrient re...Chris Evelo
This document discusses using biological network approaches to dynamically extend pathways with regulatory information such as microRNAs (miRNAs). It describes tools like PathVisio that can integrate gene expression, proteomics and metabolomics data onto pathways to identify significantly changed processes. WikiPathways is introduced as a public pathway resource that can be contributed to and curated by researchers. The document outlines approaches for visualizing regulatory interactions on pathways using plugins, exploring pathway interactions through network analysis, and integrating other data types such as SNPs, fluxes and gene annotations to build a more comprehensive understanding of biological systems.
Metabolic pathway mapping against KEGG, Reactome, HMDB and CPDBDinesh Barupal
This document describes various approaches for mapping detected metabolites to metabolic pathways using online databases and tools. It discusses obtaining KEGG identifiers for metabolites, using KEGG, Reactome, MetaboAnalyst and ConsensusPathDB to map identifiers to pathways and visualize pathways with overlays of mapped metabolites. It notes some metabolites may not have identifiers or map to pathways and emphasizes mapping identified more compounds than shown on pathway maps through enrichment analysis.
Ondex is a data integration and visualization platform used to integrate large amounts of biological data from multiple sources. It transforms the data into a graph of biological concepts and relationships. Ondex allows users to integrate data, perform semantic alignment of concepts, and visualize the integrated network. Filters and annotators can then be used to highlight specific areas of interest within the large integrated network. Ondex has been applied to problems such as candidate gene prioritization, pathway mapping, and analysis of quantitative trait loci regions in plants.
OVium Bio-Information Solutions use forefront algorithms to analyze key data resources such NCBI, EBLM and PDB to develop cell signal pathways.
OVium employs cloud and MPP computing solutions with homology and signal network mapping to develop chemical and protein pathways for discovery research.
Introduction to Cytoscape talk given in March 2010 at the CRUK CRI. Cambridge UK.
It was design to give a broad introduction the features available in Cytoscape for wet lab researchers.
This document provides an overview of downstream analyses that can be performed after variant identification and filtering in a typical variant calling pipeline. It discusses visualization of variant data in each gene to identify potential causative variants. It also mentions association studies as another type of downstream analysis where variants are tested for association with disease phenotypes. The goal of downstream analyses is to help prioritize variants for further investigation.
The document discusses the development of computational analysis tools for natural products research and metabolomics. It introduces NetPathMiner, a software tool for network path mining through gene expression data. NetPathMiner allows mining of active pathways from biological networks, handles different network formats and representations, and provides visualization of pathways and networks. It also introduces NMRPro, a tool for interactive online processing of NMR spectra, which aims to address current limitations in NMR spectral processing and sharing.
The document summarizes a workshop aimed at integrating resources between several bioinformatics standards registries. The workshop agenda includes presentations from Identifiers.org, BioSharing, BMB Service Registry, EDAM ontology, and the BMB standards registry. Breakout sessions will identify overlaps and potential synergies between the registries, and define areas for collaboration. The goal is to reduce duplication of efforts and develop a common integration and development strategy across registries.
ELIXIR aims to establish a pan-European infrastructure for biological information to support life sciences research. It will do this by coordinating nodes that provide services and resources, establishing standards, and closing skills gaps. Key challenges include sustaining data and services, ensuring interoperability, and dealing with increasingly large datasets. ELIXIR is working on pilots and task forces to address issues like cloud computing, storage, authentication and authorization.
ELIXIR is a European research infrastructure for biological information that aims to support life science research. It brings together major bioinformatics providers and is supported by 17 EU member states. ELIXIR works to safeguard biological data and build sustainable data services. It establishes a distributed infrastructure to handle the large growth of data and provides tools, services, and platforms to facilitate access and analysis of data. ELIXIR also develops standards and provides training to support computational biology. Key activities include establishing national nodes, technical task forces, and pilot projects in areas like cloud resources, data transfer, and linking distributed databases.
The document summarizes discussions from the Technical Coordinator Group (TCG) meeting. The TCG is an advisory body to the Heads of Nodes Committee and consists of technical experts from each ELIXIR Node. They discuss technical and scientific aspects of ELIXIR and identify best practices. The summary outlines the members of TCG, describes short term working groups led by technical coordinators on specific technical efforts, and provides updates on various ELIXIR task forces focusing on areas such as cloud, storage, authentication and authorization, service registry, and training.
- ELIXIR is a European research infrastructure for biological information that aims to facilitate life sciences research. It brings together over 100 bioinformatics service providers from 17 EU member states.
- The large increase in biological data from sources like DNA sequencing and mass spectrometry is outpacing storage capabilities and transfer speeds. This "data deluge" threatens to overwhelm existing infrastructure for data sharing and analysis in life sciences.
- Cloud computing provides potential solutions like more storage, data compression, keeping data close to computation, and provisioning researchers directly with storage and tools. ELIXIR and Google Cloud Platform UK discussed collaborating to host processed data, provide joint solutions for large data producers, and leverage Google Cloud capabilities to help
The European life-science data infrastructure: Data, Computing and Services ...Rafael C. Jimenez
The document provides an update on the European Life Sciences Infrastructure for Biological Information (ELIXIR). ELIXIR aims to establish a distributed infrastructure to handle the growing volume of life science data. It coordinates several national nodes that provide bioinformatics resources and services. Key recent activities include establishing legal agreements with member states, developing a technical coordinator network, and running pilot projects to test solutions and foster collaboration between nodes. Moving forward, priorities include further establishing the infrastructure and community, providing visible and useful services to users, and ensuring sustainable data management.
ELIXIR is a European research infrastructure for biological information that aims to facilitate life sciences research across Europe. It brings together life science resources from member countries to build a robust infrastructure for biological data. Individual organizations or countries cannot achieve this alone. ELIXIR establishes national nodes that leverage local strengths and priorities to deliver shared services through a distributed network. This allows the infrastructure to scale effectively with increasing data challenges. ELIXIR also works to improve data integration and interoperability across distributed resources through activities like developing standards and linking related communities.
The document discusses ELIXIR, the European Life Sciences Infrastructure for Biological Information. It provides information on ELIXIR's governance structure, member countries, and nodes. The nodes work with the central ELIXIR hub to develop and deliver bioinformatics services, resources, training, and more. The goal is to support life science research through integrated, interoperable data and tools.
This document discusses standards and data integration in life sciences databases. It notes that there are many diverse and dispersed databases in molecular biology. Standards facilitate data sharing, integration, and reuse by defining common data representation and description formats. However, integrating data across different sources is challenging due to variables like different interfaces, data types, and levels of information. Initiatives like ELIXIR and HUPO PSI aim to improve interoperability between life sciences resources through defining community standards and best practices.
The European Life Sciences Infrastructure for Biological Information (ELIXIR) coordinates biological data resources across Europe. It has several task forces working on key issues. This document summarizes discussions from the Technical Coordinators Group (TCG) meeting and provides updates from 7 task forces: Cloud, Storage, Authentication and Authorization (AAI), Service Registry, Metrics and Monitoring, Communication, and Website. Each section briefly describes the task force's goals, current work, and plans to coordinate with other groups to develop technical strategies for ELIXIR.
This document provides an introduction to programmatic access and web services for querying biological data resources. It discusses different types of query interfaces including graphical user interfaces, application programming interfaces, and web services. It then focuses on describing web services, including REST and SOAP web services. Examples are given of using PSICQUIC REST and SOAP services to query molecular interaction data. The document also introduces workflows and workflow management systems like Taverna and myExperiment that allow sharing and reusing workflows that combine multiple services.
Life science requirements from e-infrastructure:initial results from a joint...Rafael C. Jimenez
This document summarizes a workshop on life science requirements from e-infrastructure held by BioMedBridges. It discusses how big data is affecting challenges like data growth outpacing storage and transfer speeds. Potential solutions proposed include improving storage, compression, networking, partitioning data, and computing approaches like clouds. The workshop concluded that e-infrastructures need to better understand research infrastructure problems, evaluate bottlenecks, discuss solutions, and define requirements as big data will change current approaches to data sharing and management.
ELIXIR is a European research infrastructure that aims to facilitate life sciences research by coordinating the development of sustainable bioinformatics services and tools across Europe. It is working on several pilot projects to improve data integration, access to cloud computing resources, and authentication and authorization processes for sensitive data. The document discusses ELIXIR's goals and various technical work streams and task forces focused on developing strategies and standards to address challenges in integrating distributed biological data resources.
Data submissions and archiving raw data in life sciences. A pilot with Proteo...Rafael C. Jimenez
European Life Sciences Infrastructure for Biological Information aims to provide data infrastructure for biological information sharing. It is running a pilot project with proteomics data to enable standardized submission and dissemination of data between major proteomics resources like PRIDE and PeptideAtlas. The pilot allows direct archiving of raw proteomics data in PRIDE for the first time. It uses the EUDAT program for data storage and access and ProteomeXchange as a framework to link proteomics databases together. The goal is to prepare for the rapid growth of life sciences data and keep up with processing and storing the large volumes of raw data being generated.
This document discusses challenges in life sciences data management and services provided by ELIXIR to address these challenges. ELIXIR aims to facilitate life sciences research by building a sustainable infrastructure for biological data in Europe. It coordinates several nodes across member states that provide specialized data services. ELIXIR is also running pilot projects to test integration of services, including providing cloud access to reference data and distributed authentication and access to clinical archives. Future challenges include sustaining funding and scaling to handle exponentially growing data volumes.
SASI, A lightweight standard for exchanging course informationRafael C. Jimenez
The document describes SASI (Scientific Announcement Standards Initiative), a lightweight standard for exchanging course information between life science organizations. It discusses problems with the current redundant and inconsistent annotation and distribution of announcements. The proposed solution is a centralized registry for annotation using agreed-upon standards, with decentralized distribution of announcements. This would allow automatic exchange of standardized announcements to reduce effort and improve discoverability.
This document provides an overview of the European Life Sciences Infrastructure for Biological Information (ELIXIR). ELIXIR aims to build a sustainable infrastructure for biological data by coordinating existing life science resources across Europe. It will provide services for data, tools, computing, standards, and training. ELIXIR is establishing pilot projects to test integration of these services and address challenges like data access and scale. A Technical Coordination Group leads implementation and coordinates task forces to develop each area of the infrastructure. ELIXIR also partners with e-infrastructures to utilize high-performance computing and networking resources. Its goal is to support life science research and translation to areas like medicine, the environment, and bioindustry.
Anti-Universe And Emergent Gravity and the Dark UniverseSérgio Sacani
Recent theoretical progress indicates that spacetime and gravity emerge together from the entanglement structure of an underlying microscopic theory. These ideas are best understood in Anti-de Sitter space, where they rely on the area law for entanglement entropy. The extension to de Sitter space requires taking into account the entropy and temperature associated with the cosmological horizon. Using insights from string theory, black hole physics and quantum information theory we argue that the positive dark energy leads to a thermal volume law contribution to the entropy that overtakes the area law precisely at the cosmological horizon. Due to the competition between area and volume law entanglement the microscopic de Sitter states do not thermalise at sub-Hubble scales: they exhibit memory effects in the form of an entropy displacement caused by matter. The emergent laws of gravity contain an additional ‘dark’ gravitational force describing the ‘elastic’ response due to the entropy displacement. We derive an estimate of the strength of this extra force in terms of the baryonic mass, Newton’s constant and the Hubble acceleration scale a0 = cH0, and provide evidence for the fact that this additional ‘dark gravity force’ explains the observed phenomena in galaxies and clusters currently attributed to dark matter.
Mechanics:- Simple and Compound PendulumPravinHudge1
a compound pendulum is a physical system with a more complex structure than a simple pendulum, incorporating its mass distribution and dimensions into its oscillatory motion around a fixed axis. Understanding its dynamics involves principles of rotational mechanics and the interplay between gravitational potential energy and kinetic energy. Compound pendulums are used in various scientific and engineering applications, such as seismology for measuring earthquakes, in clocks to maintain accurate timekeeping, and in mechanical systems to study oscillatory motion dynamics.
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...Sérgio Sacani
We present the JWST discovery of SN 2023adsy, a transient object located in a host galaxy JADES-GS
+
53.13485
−
27.82088
with a host spectroscopic redshift of
2.903
±
0.007
. The transient was identified in deep James Webb Space Telescope (JWST)/NIRCam imaging from the JWST Advanced Deep Extragalactic Survey (JADES) program. Photometric and spectroscopic followup with NIRCam and NIRSpec, respectively, confirm the redshift and yield UV-NIR light-curve, NIR color, and spectroscopic information all consistent with a Type Ia classification. Despite its classification as a likely SN Ia, SN 2023adsy is both fairly red (
�
(
�
−
�
)
∼
0.9
) despite a host galaxy with low-extinction and has a high Ca II velocity (
19
,
000
±
2
,
000
km/s) compared to the general population of SNe Ia. While these characteristics are consistent with some Ca-rich SNe Ia, particularly SN 2016hnk, SN 2023adsy is intrinsically brighter than the low-
�
Ca-rich population. Although such an object is too red for any low-
�
cosmological sample, we apply a fiducial standardization approach to SN 2023adsy and find that the SN 2023adsy luminosity distance measurement is in excellent agreement (
≲
1
�
) with
Λ
CDM. Therefore unlike low-
�
Ca-rich SNe Ia, SN 2023adsy is standardizable and gives no indication that SN Ia standardized luminosities change significantly with redshift. A larger sample of distant SNe Ia is required to determine if SN Ia population characteristics at high-
�
truly diverge from their low-
�
counterparts, and to confirm that standardized luminosities nevertheless remain constant with redshift.
Dr. Firoozeh Kashani-Sabet is an innovator in Middle Eastern Studies and approaches her work, particularly focused on Iran, with a depth and commitment that has resulted in multiple book publications. She is notable for her work with the University of Pennsylvania, where she serves as the Walter H. Annenberg Professor of History.
Embracing Deep Variability For Reproducibility and Replicability
Abstract: Reproducibility (aka determinism in some cases) constitutes a fundamental aspect in various fields of computer science, such as floating-point computations in numerical analysis and simulation, concurrency models in parallelism, reproducible builds for third parties integration and packaging, and containerization for execution environments. These concepts, while pervasive across diverse concerns, often exhibit intricate inter-dependencies, making it challenging to achieve a comprehensive understanding. In this short and vision paper we delve into the application of software engineering techniques, specifically variability management, to systematically identify and explicit points of variability that may give rise to reproducibility issues (eg language, libraries, compiler, virtual machine, OS, environment variables, etc). The primary objectives are: i) gaining insights into the variability layers and their possible interactions, ii) capturing and documenting configurations for the sake of reproducibility, and iii) exploring diverse configurations to replicate, and hence validate and ensure the robustness of results. By adopting these methodologies, we aim to address the complexities associated with reproducibility and replicability in modern software systems and environments, facilitating a more comprehensive and nuanced perspective on these critical aspects.
https://hal.science/hal-04582287
Microbial interaction
Microorganisms interacts with each other and can be physically associated with another organisms in a variety of ways.
One organism can be located on the surface of another organism as an ectobiont or located within another organism as endobiont.
Microbial interaction may be positive such as mutualism, proto-cooperation, commensalism or may be negative such as parasitism, predation or competition
Types of microbial interaction
Positive interaction: mutualism, proto-cooperation, commensalism
Negative interaction: Ammensalism (antagonism), parasitism, predation, competition
I. Mutualism:
It is defined as the relationship in which each organism in interaction gets benefits from association. It is an obligatory relationship in which mutualist and host are metabolically dependent on each other.
Mutualistic relationship is very specific where one member of association cannot be replaced by another species.
Mutualism require close physical contact between interacting organisms.
Relationship of mutualism allows organisms to exist in habitat that could not occupied by either species alone.
Mutualistic relationship between organisms allows them to act as a single organism.
Examples of mutualism:
i. Lichens:
Lichens are excellent example of mutualism.
They are the association of specific fungi and certain genus of algae. In lichen, fungal partner is called mycobiont and algal partner is called
II. Syntrophism:
It is an association in which the growth of one organism either depends on or improved by the substrate provided by another organism.
In syntrophism both organism in association gets benefits.
Compound A
Utilized by population 1
Compound B
Utilized by population 2
Compound C
utilized by both Population 1+2
Products
In this theoretical example of syntrophism, population 1 is able to utilize and metabolize compound A, forming compound B but cannot metabolize beyond compound B without co-operation of population 2. Population 2is unable to utilize compound A but it can metabolize compound B forming compound C. Then both population 1 and 2 are able to carry out metabolic reaction which leads to formation of end product that neither population could produce alone.
Examples of syntrophism:
i. Methanogenic ecosystem in sludge digester
Methane produced by methanogenic bacteria depends upon interspecies hydrogen transfer by other fermentative bacteria.
Anaerobic fermentative bacteria generate CO2 and H2 utilizing carbohydrates which is then utilized by methanogenic bacteria (Methanobacter) to produce methane.
ii. Lactobacillus arobinosus and Enterococcus faecalis:
In the minimal media, Lactobacillus arobinosus and Enterococcus faecalis are able to grow together but not alone.
The synergistic relationship between E. faecalis and L. arobinosus occurs in which E. faecalis require folic acid
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆Sérgio Sacani
Context. The early-type galaxy SDSS J133519.91+072807.4 (hereafter SDSS1335+0728), which had exhibited no prior optical variations during the preceding two decades, began showing significant nuclear variability in the Zwicky Transient Facility (ZTF) alert stream from December 2019 (as ZTF19acnskyy). This variability behaviour, coupled with the host-galaxy properties, suggests that SDSS1335+0728 hosts a ∼ 106M⊙ black hole (BH) that is currently in the process of ‘turning on’. Aims. We present a multi-wavelength photometric analysis and spectroscopic follow-up performed with the aim of better understanding the origin of the nuclear variations detected in SDSS1335+0728. Methods. We used archival photometry (from WISE, 2MASS, SDSS, GALEX, eROSITA) and spectroscopic data (from SDSS and LAMOST) to study the state of SDSS1335+0728 prior to December 2019, and new observations from Swift, SOAR/Goodman, VLT/X-shooter, and Keck/LRIS taken after its turn-on to characterise its current state. We analysed the variability of SDSS1335+0728 in the X-ray/UV/optical/mid-infrared range, modelled its spectral energy distribution prior to and after December 2019, and studied the evolution of its UV/optical spectra. Results. From our multi-wavelength photometric analysis, we find that: (a) since 2021, the UV flux (from Swift/UVOT observations) is four times brighter than the flux reported by GALEX in 2004; (b) since June 2022, the mid-infrared flux has risen more than two times, and the W1−W2 WISE colour has become redder; and (c) since February 2024, the source has begun showing X-ray emission. From our spectroscopic follow-up, we see that (i) the narrow emission line ratios are now consistent with a more energetic ionising continuum; (ii) broad emission lines are not detected; and (iii) the [OIII] line increased its flux ∼ 3.6 years after the first ZTF alert, which implies a relatively compact narrow-line-emitting region. Conclusions. We conclude that the variations observed in SDSS1335+0728 could be either explained by a ∼ 106M⊙ AGN that is just turning on or by an exotic tidal disruption event (TDE). If the former is true, SDSS1335+0728 is one of the strongest cases of an AGNobserved in the process of activating. If the latter were found to be the case, it would correspond to the longest and faintest TDE ever observed (or another class of still unknown nuclear transient). Future observations of SDSS1335+0728 are crucial to further understand its behaviour. Key words. galaxies: active– accretion, accretion discs– galaxies: individual: SDSS J133519.91+072807.4
2. Molecular Biology Database resources
Human Genes and
Diseases
13%
Proteomics Resources
1%
Other Molecular
Biology Databases
3%
Immunological
databases
2%
Plant databases
7%
Organelle databases
2%
Human and other
Vertebrate Genomes
8%
Nucleotide Sequence
Databases
9%
RNA sequence
databases
5%
Protein sequence
databases
13%
Structure Databases
9%
, Genomics Databases
non-vertebrate
19%
Metabolic and
Signaling Pathways
9%
Nucleic Acids Research annual
Database Issue and the NAR online
Molecular Biology Database
Collection in 2009. MY Galperin, GR
Cochrane - Nucleic Acids Research,
2008
~1440
resources
3. Molecular Biology Database resources
• Metabolic and Signaling Pathways
Enzymes and
enzyme
nomenclature
12%
Metabolic pathways
21%
Protein -protein
Interactions
62%
Signaling pathways
5%
Nucleic Acids Research annual Database Issue and the
NAR online Molecular Biology Database Collection in
2009MY Galperin, GR Cochrane - Nucleic Acids Research,
~122
resources
4. Biological pathway resources
Pathguide
• Categories
Other
4%
Protein -Protein
Interactions
34%
Metabolic Pathways
20%Pathway Diagrams
10%
Transcription Factors
/ Gene Regulatory
Networks
15%
Protein -Compound
Interactions
11%
Protein Sequence
Focused
6%
http://www.pathguide.org
~303
resources
5. Centralized databases VS In-house databases
DB
GUI
API
WS
Centralized database
A AA A
DB
GUI
API
WS
DB
GUI
API
WS
DB
GUI
API
WS
DB
GUI
API
WS
In-house databases
A AA A
A Annotator Database
Graphical User Interface
Application programming interface
Web Services
GUI
API
WS
User Standard protocolSP
7. Many databases VS Federation
DB
GUI
API
WS
DB DB DB
SP SP SP SP
DB
GUI
API
WS
DB
GUI
API
WS
DB
GUI
API
WS
DB
GUI
API
WS
Many databases Federation
Database Graphical User InterfaceGUI User Standard protocolSP
9. Data integration
• Combining data residing in different sources
• … providing users with a unified view of these data.
Main objective Requires
• Share
• Compare
• Unify
– Data from the same domain
– Data from different domains
• Federated systems
• Standard formats
• Mapping tools
• Ontologies
10. Data integration
• Federated systems
– DAS
– PSICQUIC
– …
• Standard formats
– DAS
– PSI-MI
– BioPAX
– SBML
– CellML
– …
• Ontologies
– OLS
– …
• Mapping tools
– PICR
– Uniprot API
– Ensembl API
– DAS
– Biomart
– …
• Integration systems
– Biomart
– EnCORE
– …
11. Standards development – international collaborations
Genome annotation
www.geneontology.org
Genome annotation
www.geneontology.org
Microarray and Gene
Expression Data (MGED)
www.mged.org
Microarray and Gene
Expression Data (MGED)
www.mged.org
Protein sequence
www.uniprot.org
Protein sequence
www.uniprot.org
HUPO-
Proteomics
Standards
Initiative (PSI)
Psidev.sf.net
HUPO-
Proteomics
Standards
Initiative (PSI)
Psidev.sf.net
Protein structure
www.wwpdb.org
Protein structure
www.wwpdb.org
Cheminformatics
www.ebi.ac.uk/chebi
Cheminformatics
www.ebi.ac.uk/chebi
Pathways
www.reactome.org
www.biopax.org
Pathways
www.reactome.org
www.biopax.org Systems modelling
standards
www.sbml.org
Systems modelling
standards
www.sbml.orgMetabolomics Standards Initiative (MSI)
www.metabolomicssociety.org
Metabolomics Standards Initiative (MSI)
www.metabolomicssociety.org
Genomics Standards Consortium (GSC)
gensc.org
Genomics Standards Consortium (GSC)
gensc.org
Nucleotide sequence
www.insdc.org
Nucleotide sequence
www.insdc.org
12. The Distributed Annotation System, 2001 Dowell et al;
BMC Bioinformatics. 2001; 2: 7. Published online 2001 October 10.
DAS, Architectural Overview
illustration
14. DAS servers and data types
Genome sequence
Sequence alignments
Protein sequence
Protein-protein interaction
Gel 2D
EMAP
3DM
Protein structure
Protein structure
EMAP
3DM
Protein-protein interaction
Protein structure
Gel 2D
Mass spectrometry
Epigenetics
Phenotype
Functional genomics
Structural genomics
Protein sequence
Alignment servers Annotation servers Reference servers
19. ENFIN Network of Excellence
• Brings together
experimentalists and
computational biologists to
develop the next generation of
informatics resources for
systems biology
• Funded by the European
Commission within its FP6
programme under the
thematic area ‘Life sciences,
genomics and biotechnology
for health’
• 20 partners in 13 countries
• www.enfin.org
21. Diverse service world
SOAP, REST,
Java API, Perl
API, FTP,
GUI, …
External data sources
Different formats
Access interfaces
User
?integration
• Multiple manual connections
• Multiple technologies
• Multiple result files which have to be combined manually
• Much work to reproduce
XML, CSV,
Plain Text,
JSON, …
22. Standardised EnCORE world
Heterogeneous
external world
Standardised
EnCORE world
EnXML
External data sources
EnCORE services
EnVISION pages
API, WS access
Standard EnXML format
User
input output
24. EnCORE services
From Inputs to Outputs
Positive Negative
Input/Query
Output/Results
Program/Service
EnCORE dataset
EnCORE
results
EnCORE webservice
• Enfin-IntAct
• Enfin-PRIDE
• Enfin-Affy2UniProt
• Enfin-PICR
• Enfin-Reactome
• Enfin-ArrayExpress
• Enfin-UniProt
• Enfin-BioModels
• Enfin-KEGG
• Enfin-G:GOSt
• Enfin-CellMINT
• Enfin-DOMAINATION
• Database IDs
• Sequences
• Experiment: Identifies the result
• Sets: Contains the structure of the result
• Molecules: Includes the results
• Features: Describe details of the result
25. EnCORE services
Example
Positive Negative
Input/Query
Output/Results
Program/Service
EnCORE dataset
EnCORE
results
EnCORE webservice
• Encore webservice
Enfin-IntAct
• Database ID (Uniprot ID)
P37173
• Experiment: ID4
• Sets: (1)EBI-296235, (2)EBI-1033040, (3) EBI-
902913, EBI-902937, (4) EBI-296166, EBI-296246,
(5)EBI-902913
• Molecules: (1)O35613, (2)P10600, (3)P07200,
(4)Q9UER7, (5)Q99K41
• Features: No features
26. EnCORE services
Example (Result on a table)
Interactor A Interactor B Interaction IDs
1 P37173 O35613 EBI-296235
2 P37173 P10600 EBI-1033040
3 P37173 P07200 EBI-902913, EBI-902937
4 P37173 Q9UER7 EBI-296166, EBI-296246
5 P37173 Q99K41 EBI-902913
Input/Query
Output/Results
Program/Service
Enfin-IntAct
P37173
33. Adapting EnCORE to
Standards and Federation
Domain 1
External data sources
Federated systems / Standards
EnVISION pages
WS
WS
Web interface
EnCORE wrapper
35. Adapting EnCORE to
Standards and Federation
• Integration of sources.
• Filtering redundancy (whenever possible)
• Interconnect results.
Domain 5 Domain …Domain 4
Domain 2 Domain 3Domain 1
36. Predefined workflows
Run different services
on the same input
Use the output of one
service as an input of
another service
EnVISION EnVISION2
37. Predefined workflows and automated workflows
• “Semantic Web” promises to use data sources and
analysis tools to automatically build workflows that make
sense to satisfy users’ requests.
• Early stage of “Semantic Web”, not a practical solution to
apply on our workflows.
• Useful workflows require users to go though each step of
the workflow.
• Our problem using predefined workflows:
– Explosion of results.
– Workflow configuration is subjective.
– We could come up with multiple predefined combinations.
– Limitations to define its configuration.
38. User selection based workflow
Query
Results
EnCORE WS
Positive Negative
User
selection
Query
Results
EnCORE WS
Positive Negative
User selection
EnCORE WS
Negative
Positive
Positive Negative
Results
EnCORE WS
1 2
3
39. Biological pathway resources
Pathguide
• Data Access Methods
0 50 100 150 200 250
Browsing / Canned queries
Keyword searches
Download in other format
Download in BioPAX format
Download in PSI format
Download in SBML format
SQL queries
Download in CellML format
Standards
40. Conclusions
• Data integration
– Adopting standards formats
– Building a federated system of sources
– Describing data with ontologies
– Using standard identifiers
– Mapping references from different domains
Integration of biological data of various types and development of adapted bioinformatics tools represent critical objectives to enable research at the systems level. The European Network of Excellence ENFIN is engaged in developing an adapted infrastructure to connect databases, and platforms to enable both generation of new bioinformatics tools and experimental validation of computational predictions. Beyond the use of common standards to format individual datasets, there is a need for sophisticated informatics platforms to enable mining data across various domains, sources, formats and types. The aim of the EnCORE project is to integrate across different disciplines an extensive list of database resources and analysis tools in a computationally accessible and extensible manner, facilitating automated data retrieval and processing with a special focus on systems biology. The EnCORE platform is available as a collection of webservices with a common standard format easy to integrate in Workflow management software such as Taverna. Additionally EnCORE services are also accessible thought EnVISION, a web graphical user interface providing elaborated information such as molecular interaction, biological pathways and computational models of pathways.
EBI has a comprehensive collection of databases.
We are not alone.
The latest “Molecular Biology Database Collection” published in NAR describes more than 1400 database resources.
From these 1400, it describes more than 100 pathway databases.
Pathguide lists more than 300 pathway resources.
In general we can say there are lots of pathway resources available in different databases.
As a biologist I would prefer to see all the information in one unique database.
Centralized databases have this mission.
The aim to collect all the information for one specific domain.
However …
Medium-size labs and organizations are capable to produce large amounts of data.
The it becomes harder to submit data to centralized repositories.
Moreover data producers like to control and structure their own databases, developing their own GUI and access protocols.
For us, the users, it becomes harder to access the information.
For one specific domain we might find different databases, using different GUIs. We might end up downloading data in different formats complicating the integration of results. After integration we might find a problem of high redundancy in our results.
This integration problem is well defined by this chart.
In bioinformatics before we didn’t have to much data available to help biologist
Now we have the data but it is not very useful if it is difficult to find and difficult to access.
Data producers have good reasons to have their own database.
However among all of us have to think about ways to share our data and make it easily available to user.
Federation provides an easy way to integrate data resources.
100% compatible with database providers continuing working with their own database structure, GUI, ...
…
Mapping tools to be able to work in the same identification space.
…
A protocol to exchange data.
A network of biological resources
A standard XML formal
A federated system.
Different distributed databases install the DAS protocol
Now a client, a user can use the same query for the all these databases
And all the database will return the results in the same standard XML format over the internet.
For the client it is easy to put all the annotations together.
<number>
EnVISION2 (ENFIN tool) query for molecular interactions using PSICQUIC.
It connects to the registry to find out what servers are available
And query for molecular interaction for a list of Protein Acc (in this case two proteins interacting with each other).
It merges results filtering redundancy and display the results in a table and in a interaction network.
Let me talk about what we do in ENFIN in data integration
ENFIN is a platform, a project that brings together experimentalist and computational biologist to help each other and develop bioinformatics resources for systems biology.
The idea behind EnCORE is simplified in this picture
Input (our query) is contained in a XML standard format called EnXML
We can run different services over this input.
We get results contained in the same EnXML format
The Outputs can be use as inputs of other services.
We are exposed to a very diverse service world
EnCORE provides an easy way to build workflows since input and outputs share the same standard format
This is a generic example of how an EnCORE service work
An specific example
The query is a protein Acc
We run the Intact service
We get the interactions result defined by the EnXML terminology
The same results in a table
EnCORE facilitates building workflows
EnVISION is an EnCORE interface
With just one click user can run different services get a quick overview for a dataset
This example shows result for …
Here an example of the potential of EnVISION
In this example we used a dataset of more than 300 protein Acc.
In this screenshot EnVISION was able to find more than 500 pathways for this dataset.
EnVISION is capable to link and display positive results in a pathway map.
UP: Reaction present in our dataset
MIDDLE: Heatmap
DOWN: Proteins from our dataset found in reactome reactions
Heatmap displaying represented pathways in our dataset.
Color identifies the better hits
Red means there are more proteins from our dataset are present in the reaction.
EnVISION results are nice, but do not forget our initial integration problem
For one domain (protein interaction, pathways, protein sequence …) we might have several databases providing data
EnCORE provides a great solution however it is not complete if it can not include more resources
For EnCORE it is not feasible to develop and maintain so many wrappers.
Nonetheless EnCORE can overcome this problem using standards and federated systems
Right now EnCORE workflows are predifined.
Two types of workflow.
They are static, not very easy to adapt to user needs.
Web semantics seems to be the solution to build intelligent workflows.
However web semantics is in a early stage.
Molecular biology to complicated for Web Semantics
I personally think “user selection based workflows” is a better solution for developers and users as far as we keep it simple like it is in EnCORE.
To conclude …
I would like to see more database providers and users using standards.
Just 20% of the DB described in PATHGUIDE use standards.
Data integration and properly representation of pathway information will be possible if …
Integration of biological data of various types and development of adapted bioinformatics tools represent critical objectives to enable research at the systems level. The European Network of Excellence ENFIN is engaged in developing an adapted infrastructure to connect databases, and platforms to enable both generation of new bioinformatics tools and experimental validation of computational predictions. Beyond the use of common standards to format individual datasets, there is a need for sophisticated informatics platforms to enable mining data across various domains, sources, formats and types. The aim of the EnCORE project is to integrate across different disciplines an extensive list of database resources and analysis tools in a computationally accessible and extensible manner, facilitating automated data retrieval and processing with a special focus on systems biology. The EnCORE platform is available as a collection of webservices with a common standard format easy to integrate in Workflow management software such as Taverna. Additionally EnCORE services are also accessible thought EnVISION, a web graphical user interface providing elaborated information such as molecular interaction, biological pathways and computational models of pathways.