Presentation of Eugeni Belda (LABGeM-Genoscope) at the Biocuration 2012 conference (Georgetown University, Washington DC): From bacterial genome annotation to metabolic pathway curation
The document describes a study evaluating the Agilent Q-TOF 6520 LC/MS platform for proteomic analysis of brush border membranes from rat kidney proximal tubules. The study used two parallel workflows: 1) C18-NSI-MS2 on an LTQ mass spectrometer for preliminary analysis, and 2) C18-NSI-MS2 on an Agilent Q-TOF 6520 mass spectrometer. Peptide and spectral counts were higher for the Q-TOF data. Feature finding using retention times, mass accuracy, and MS2 identifications correlated identified peptides across sample arrays, validating the increased identification capabilities of the Q-TOF platform.
Multi-scale network biology model & the model librarylaserxiong
This document discusses multi-scale network biology models and a network model library. It describes how the library would contain different types of nodes and edges to represent diverse biological interactions. The library would annotate pre-defined network models and integrate updated models. It also discusses multi-scale networks from the inter-cellular to inter-tissue levels. A case study on prioritizing pre-clinical drugs via prognosis-guided genetic interaction networks is mentioned. The document notes challenges in current disease models for drug development and proposes approaches like synergistic outcome determination and module-module cooperation networks to address them.
1) GenWiki is a wiki system that seamlessly integrates natural language processing (NLP) capabilities to help biocurators manually refine and update bioinformatics databases by extracting semantic entities and facts from literature.
2) The system uses the mycoMINE text mining pipeline based on GATE to perform NLP tasks like named entity recognition, fact extraction, and ontology population.
3) An evaluation showed GenWiki reduced the average curation time for selecting an abstract by 67% and for reviewing a full paper by 20% compared to having no semantic support.
"Biomolecular annotation prediction through information integration" - Davide...Davide Chicco
Talk of Davide Chicco delivered at the 8h International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB2011), Gargnano Sul Garda, Lumbardy, July 2011
Integration of Bioinformatics Web Services through the Search Computing Techn...Davide Chicco
Here are the key steps in Latent Semantic Indexing using SVD to measure semantic similarity between genes:
1. Build an annotation matrix with genes as rows and annotation terms as columns, with 1's indicating which genes are annotated to which terms.
2. Perform SVD on the annotation matrix to decompose it into three matrices: Uk, Σk, VTk.
3. Uk contains the vectors representing each gene in the reduced k-dimensional semantic space.
4. The similarity between two genes can be measured as the cosine similarity between their corresponding vectors in Uk. Genes with more similar vectors are considered more semantically similar based on their shared annotations.
So in summary, LSI uses SVD to project
Integrating Public and Private Data: Lessons Learned from UnisonReece Hart
The document discusses lessons learned from integrating public and private data using the Unison platform. It describes the types of data that can be integrated, including genomics, proteomics, chemistry, networks, and clinical data. It outlines different types of integration like semantic and source integration. Challenges of integration include establishing relationships between data and handling frequent updates. Benefits include enabling analysis across diverse data types and centralizing data. Unison integrates sequences, annotations, auxiliary data and precomputed predictions from sources like UniProt and Ensembl to power applications, in-house tools and data mining projects.
Unison: Enabling easy, rapid, and comprehensive proteomic miningReece Hart
Unison is an online database and data integration platform that aggregates proteomic and genomic data from multiple sources and provides over 200 million precomputed predictions on protein sequences, domains, structures, and more. It aims to enable easy, rapid, and comprehensive proteomic mining through semantic integration of distinct data types and automated querying of predictions. Custom data mining projects using Unison have led to discoveries about proteins like Bcl-2 that regulate apoptosis.
The document describes a study evaluating the Agilent Q-TOF 6520 LC/MS platform for proteomic analysis of brush border membranes from rat kidney proximal tubules. The study used two parallel workflows: 1) C18-NSI-MS2 on an LTQ mass spectrometer for preliminary analysis, and 2) C18-NSI-MS2 on an Agilent Q-TOF 6520 mass spectrometer. Peptide and spectral counts were higher for the Q-TOF data. Feature finding using retention times, mass accuracy, and MS2 identifications correlated identified peptides across sample arrays, validating the increased identification capabilities of the Q-TOF platform.
Multi-scale network biology model & the model librarylaserxiong
This document discusses multi-scale network biology models and a network model library. It describes how the library would contain different types of nodes and edges to represent diverse biological interactions. The library would annotate pre-defined network models and integrate updated models. It also discusses multi-scale networks from the inter-cellular to inter-tissue levels. A case study on prioritizing pre-clinical drugs via prognosis-guided genetic interaction networks is mentioned. The document notes challenges in current disease models for drug development and proposes approaches like synergistic outcome determination and module-module cooperation networks to address them.
1) GenWiki is a wiki system that seamlessly integrates natural language processing (NLP) capabilities to help biocurators manually refine and update bioinformatics databases by extracting semantic entities and facts from literature.
2) The system uses the mycoMINE text mining pipeline based on GATE to perform NLP tasks like named entity recognition, fact extraction, and ontology population.
3) An evaluation showed GenWiki reduced the average curation time for selecting an abstract by 67% and for reviewing a full paper by 20% compared to having no semantic support.
"Biomolecular annotation prediction through information integration" - Davide...Davide Chicco
Talk of Davide Chicco delivered at the 8h International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB2011), Gargnano Sul Garda, Lumbardy, July 2011
Integration of Bioinformatics Web Services through the Search Computing Techn...Davide Chicco
Here are the key steps in Latent Semantic Indexing using SVD to measure semantic similarity between genes:
1. Build an annotation matrix with genes as rows and annotation terms as columns, with 1's indicating which genes are annotated to which terms.
2. Perform SVD on the annotation matrix to decompose it into three matrices: Uk, Σk, VTk.
3. Uk contains the vectors representing each gene in the reduced k-dimensional semantic space.
4. The similarity between two genes can be measured as the cosine similarity between their corresponding vectors in Uk. Genes with more similar vectors are considered more semantically similar based on their shared annotations.
So in summary, LSI uses SVD to project
Integrating Public and Private Data: Lessons Learned from UnisonReece Hart
The document discusses lessons learned from integrating public and private data using the Unison platform. It describes the types of data that can be integrated, including genomics, proteomics, chemistry, networks, and clinical data. It outlines different types of integration like semantic and source integration. Challenges of integration include establishing relationships between data and handling frequent updates. Benefits include enabling analysis across diverse data types and centralizing data. Unison integrates sequences, annotations, auxiliary data and precomputed predictions from sources like UniProt and Ensembl to power applications, in-house tools and data mining projects.
Unison: Enabling easy, rapid, and comprehensive proteomic miningReece Hart
Unison is an online database and data integration platform that aggregates proteomic and genomic data from multiple sources and provides over 200 million precomputed predictions on protein sequences, domains, structures, and more. It aims to enable easy, rapid, and comprehensive proteomic mining through semantic integration of distinct data types and automated querying of predictions. Custom data mining projects using Unison have led to discoveries about proteins like Bcl-2 that regulate apoptosis.
This document discusses the process of analyzing sequencing data from the NA12878 reference sample. It describes the 3 phases required to turn raw sequencing reads into usable variant calls: 1) NGS data processing, 2) variant discovery and genotyping, and 3) integrative analysis. Phase 1 involves tasks like mapping, local realignment, and duplicate marking to produce analysis-ready reads. Phase 2 identifies SNPs, indels and structural variants. Phase 3 performs quality control and combines results with other data. The document emphasizes the extensive processing needed to produce reliable variant calls from raw sequencing data.
The GeneArt® Gene Synthesis service consists of chemical synthesis, cloning, and sequence verification of virtually any desired genetic sequence. You will receive a bacterial stab and/or purified plasmid containing your synthesized gene—ready for downstream applications.
Whether you have limited cloning experience or simply want to save time, the GeneArt® Gene Synthesis service helps you move your ideas from the planning stage to the laboratory more quickly. Benefit from our experience in successfully producing over 180,000 constructs for customers as diverse as large pharmaceutical companies, biotechnology start-ups, and basic research institutions. The comparison shown in the figure below highlights the time and effort saved compared to traditional cloning. For more information visit:
https://www.invitrogen.com/site/us/en/home/Products-and-Services/Applications/Cloning/gene-synthesis.html?CID=genesynthesis-SS-12312
Consortium to produce_bio_fuels_from_jatropha[1]ehiosa
This document summarizes a consortium project between institutions in Japan, Indonesia, and Botswana to develop Jatropha plants that can produce clean biofuel through molecular breeding. The goals are to increase Jatropha productivity and develop plants that absorb more carbon dioxide. Participating organizations will work on molecular breeding techniques, field testing in different environments, and evaluating fuel production from higher yielding Jatropha varieties. The end goal is to assist energy needs in Asia and Africa through a sustainable Jatropha biofuel production system.
Metin Bilgin is a molecular and cellular biologist with over 12 years of postgraduate research experience. He has expertise in proteomics, protein expression, and characterization. Some of his accomplishments include co-developing the first proteome chip and establishing HTP assay protocols for protein array technology. He has studied various topics like cell cycle regulation, cytochrome P450 metabolism, and nuclear hormone receptor regulated drug metabolism. Currently, he is a postdoctoral research associate studying regulation of cytochrome P450 activity by nuclear hormone receptor CAR. He aims to work for a leading life sciences research company focused on discovery and translational medicine.
This is the second presentation of the BITS training on 'Mass spec data processing'.
It reviews the methods for separating protein mixtures prior to further analysis.
Thanks to the Compomics Lab of the VIB for contribution.
1) AbstractDB & ProteinComplexDB are databases that contain protein complexes extracted from PubMed abstracts along with the abstracts themselves.
2) The databases were developed using a Bayesian classifier to rank abstracts by their relevance to protein complexes based on the frequency of discriminatory words.
3) The databases allow users to validate extracted protein complexes by searching against known complex databases and enable scientists to evaluate and revise the data.
Network cheminformatics: gap filling and identifying new reactions in metabol...Neil Swainston
The number of published metabolic network reconstructions are increasing, as are their applications. However, such reconstructions commonly include gaps (see Figure 1), which are due to incomplete source databases or holes in biochemical knowledge reported in literature. The filling of such gaps has been aided through automated techniques which attempt to mitigate these gaps by adding reactions from external resources such as KEGG.
The approach introduced here is to apply cheminformatics to determine and quantify chemical similarity across all metabolites in a metabolic network of S. cerevisiae. The hypothesis is that those metabolite pairs of high chemical similarity are likely to form reaction pairs, in which one metabolite can be converted to the other by a single chemical reaction. The similar scoring pairs that do not currently form a reaction pair in the network can be analysed, by either comparison with existing data resources or by literature searches, to determine whether they take part in a metabolic reaction.
Following this approach, preliminary results have led to the discovery of missing information from KEGG, and the assignment of function and determination of kinetic constants to a gene of previously unknown function.
Structure generation, metabolite space, and metabolite likenessVodafoneZiggo
The document discusses metabolite identification from mass spectrometry data. It describes how a structure generator works to generate candidate metabolite structures from an elemental composition. The generator adds bonds in all possible ways to create structures, then uses isomorphism and canonical labeling to remove duplicate structures within the same isomorphic class. This process generates a list of candidate metabolite structures for further analysis and filtering against experimental data.
Course: Bioinformatics for Biomedical Research (2014).
Session: 1.3- Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensembl, Biomart and IGV.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Pathways and genomes databases in bioinformaticssarwat bashir
The document discusses the PAGED database, which integrates various bioinformatics databases to enable molecular phenotype discovery. PAGED contains over 25,000 gene sets from sources like pathways, disease-gene associations, gene signatures, microRNA targets, and protein-protein interaction networks. It allows users to explore relationships between gene sets and identify pathways, signatures, and modules associated with specific human diseases. The database was designed to integrate data from several sources and allow comprehensive searches and analysis to further biological research.
The document discusses various bioinformatics databases that store different types of biological information such as DNA sequences, protein sequences, protein structures, gene expression data, and biomedical literature. It describes several major public primary databases like GenBank and PDB as well as derived databases like Swiss-Prot, UniGene, and RefSeq that compile or integrate data from primary sources. The databases are interlinked and can be accessed through search tools on sites like NCBI Entrez.
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...Surya Saha
CitrusCyc is a metabolic pathway database for the Citrus clementina and Citrus sinensis genomes. It was constructed using the Pathway Tools software and contains pathways, reactions, enzymes and genes derived from the annotated citrus genomes and the MetaCyc database. The database contains over 25,000 proteins and 40,000 transcripts with EC numbers for both citrus species. It provides visualizations of metabolic pathways and allows for overlay of RNA-seq expression data. Future work includes manual curation of pathways and development of a Meta-CitrusCyc database.
This document provides an outline for classroom content with a page title and two main items, each with sub-items. Item 1 has three sub-items labeled Sub 1a, Sub 1b, and Sub 1c. Item 2 has four sub-items labeled Sub 2a, Sub 2b, Sub 2c, and Sub 2d.
The document discusses various topics related to drug discovery through bioinformatics and computational approaches. It begins by discussing comparative genomics and using knowledge about model organisms to identify similar biological areas and pathways in other species. It also discusses topics like high-throughput screening of large libraries, the definitions of targets, hits and leads in drug discovery, and approaches like using RNAi and phenotypic screening in model organisms. Finally, it discusses computational methods that can be used throughout the drug discovery process, including for target identification and validation, virtual screening, assessing drug-likeness of compounds, and describing compounds using structural and physicochemical descriptors.
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekData Driven Innovation
This document summarizes genomic big data management, integration and mining. It discusses the exponential growth of biological data due to advances in sequencing technologies. Next generation sequencing techniques generate large amounts of short DNA reads. Several public databases contain heterogeneous biological data sources. Effective data management and integration methods are needed to analyze these large and complex datasets. Supervised machine learning can be used to extract knowledge and classify samples. Tools like CAMUR apply rule-based classification to problems like analyzing gene expression from cancer datasets. Future work involves advanced integration systems and new big data approaches for biological data.
The document discusses metabolic pathway engineering and metabolic engineering. It provides an overview of four commercially important fermentation products, including the microorganism used, annual production levels, and applications. It then discusses the core concepts of metabolic engineering, including manipulating enzymatic and regulatory functions using recombinant DNA to improve cellular activities. Examples of applications include strain improvement for biocatalysis and bioprocessing, increasing productivity, and developing novel biosynthetic routes.
Using ontologies to do integrative systems biologyChris Evelo
The document discusses using ontologies to integrate systems biology data. It describes typical steps in systems biology studies such as finding studies, processing data, integrating data, and combining data from multiple sources. Ontologies can help link information from different analysis techniques and combine data from many studies by capturing study metadata. The document advocates using standards like ISA-TAB and MAGE-TAB to capture study data and proposes using a generic study capture framework with modular components to integrate different types of 'omics data. Ontologies are needed for collaboration and to provide controlled vocabularies for annotation.
Stephen Friend Fanconi Anemia Research Fund 2012-01-21Sage Base
This document summarizes Stephen Friend's presentation on using data intensive science and bionetworks to build better maps of human diseases. It discusses how collecting and integrating massive amounts of molecular and clinical data using open information systems and computing could enable the development of more comprehensive and probabilistic causal models of diseases. These evolving disease maps may help identify causal genes and pathways involved in various conditions. The presentation outlines Sage Bionetworks' mission to create a commons for scientists to collaborate on building and refining such integrative bionetworks to accelerate the elimination of human disease.
This document discusses the process of analyzing sequencing data from the NA12878 reference sample. It describes the 3 phases required to turn raw sequencing reads into usable variant calls: 1) NGS data processing, 2) variant discovery and genotyping, and 3) integrative analysis. Phase 1 involves tasks like mapping, local realignment, and duplicate marking to produce analysis-ready reads. Phase 2 identifies SNPs, indels and structural variants. Phase 3 performs quality control and combines results with other data. The document emphasizes the extensive processing needed to produce reliable variant calls from raw sequencing data.
The GeneArt® Gene Synthesis service consists of chemical synthesis, cloning, and sequence verification of virtually any desired genetic sequence. You will receive a bacterial stab and/or purified plasmid containing your synthesized gene—ready for downstream applications.
Whether you have limited cloning experience or simply want to save time, the GeneArt® Gene Synthesis service helps you move your ideas from the planning stage to the laboratory more quickly. Benefit from our experience in successfully producing over 180,000 constructs for customers as diverse as large pharmaceutical companies, biotechnology start-ups, and basic research institutions. The comparison shown in the figure below highlights the time and effort saved compared to traditional cloning. For more information visit:
https://www.invitrogen.com/site/us/en/home/Products-and-Services/Applications/Cloning/gene-synthesis.html?CID=genesynthesis-SS-12312
Consortium to produce_bio_fuels_from_jatropha[1]ehiosa
This document summarizes a consortium project between institutions in Japan, Indonesia, and Botswana to develop Jatropha plants that can produce clean biofuel through molecular breeding. The goals are to increase Jatropha productivity and develop plants that absorb more carbon dioxide. Participating organizations will work on molecular breeding techniques, field testing in different environments, and evaluating fuel production from higher yielding Jatropha varieties. The end goal is to assist energy needs in Asia and Africa through a sustainable Jatropha biofuel production system.
Metin Bilgin is a molecular and cellular biologist with over 12 years of postgraduate research experience. He has expertise in proteomics, protein expression, and characterization. Some of his accomplishments include co-developing the first proteome chip and establishing HTP assay protocols for protein array technology. He has studied various topics like cell cycle regulation, cytochrome P450 metabolism, and nuclear hormone receptor regulated drug metabolism. Currently, he is a postdoctoral research associate studying regulation of cytochrome P450 activity by nuclear hormone receptor CAR. He aims to work for a leading life sciences research company focused on discovery and translational medicine.
This is the second presentation of the BITS training on 'Mass spec data processing'.
It reviews the methods for separating protein mixtures prior to further analysis.
Thanks to the Compomics Lab of the VIB for contribution.
1) AbstractDB & ProteinComplexDB are databases that contain protein complexes extracted from PubMed abstracts along with the abstracts themselves.
2) The databases were developed using a Bayesian classifier to rank abstracts by their relevance to protein complexes based on the frequency of discriminatory words.
3) The databases allow users to validate extracted protein complexes by searching against known complex databases and enable scientists to evaluate and revise the data.
Network cheminformatics: gap filling and identifying new reactions in metabol...Neil Swainston
The number of published metabolic network reconstructions are increasing, as are their applications. However, such reconstructions commonly include gaps (see Figure 1), which are due to incomplete source databases or holes in biochemical knowledge reported in literature. The filling of such gaps has been aided through automated techniques which attempt to mitigate these gaps by adding reactions from external resources such as KEGG.
The approach introduced here is to apply cheminformatics to determine and quantify chemical similarity across all metabolites in a metabolic network of S. cerevisiae. The hypothesis is that those metabolite pairs of high chemical similarity are likely to form reaction pairs, in which one metabolite can be converted to the other by a single chemical reaction. The similar scoring pairs that do not currently form a reaction pair in the network can be analysed, by either comparison with existing data resources or by literature searches, to determine whether they take part in a metabolic reaction.
Following this approach, preliminary results have led to the discovery of missing information from KEGG, and the assignment of function and determination of kinetic constants to a gene of previously unknown function.
Structure generation, metabolite space, and metabolite likenessVodafoneZiggo
The document discusses metabolite identification from mass spectrometry data. It describes how a structure generator works to generate candidate metabolite structures from an elemental composition. The generator adds bonds in all possible ways to create structures, then uses isomorphism and canonical labeling to remove duplicate structures within the same isomorphic class. This process generates a list of candidate metabolite structures for further analysis and filtering against experimental data.
Course: Bioinformatics for Biomedical Research (2014).
Session: 1.3- Genome Browsing, Genomic Data Mining and Genome Data Visualization with Ensembl, Biomart and IGV.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Pathways and genomes databases in bioinformaticssarwat bashir
The document discusses the PAGED database, which integrates various bioinformatics databases to enable molecular phenotype discovery. PAGED contains over 25,000 gene sets from sources like pathways, disease-gene associations, gene signatures, microRNA targets, and protein-protein interaction networks. It allows users to explore relationships between gene sets and identify pathways, signatures, and modules associated with specific human diseases. The database was designed to integrate data from several sources and allow comprehensive searches and analysis to further biological research.
The document discusses various bioinformatics databases that store different types of biological information such as DNA sequences, protein sequences, protein structures, gene expression data, and biomedical literature. It describes several major public primary databases like GenBank and PDB as well as derived databases like Swiss-Prot, UniGene, and RefSeq that compile or integrate data from primary sources. The databases are interlinked and can be accessed through search tools on sites like NCBI Entrez.
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...Surya Saha
CitrusCyc is a metabolic pathway database for the Citrus clementina and Citrus sinensis genomes. It was constructed using the Pathway Tools software and contains pathways, reactions, enzymes and genes derived from the annotated citrus genomes and the MetaCyc database. The database contains over 25,000 proteins and 40,000 transcripts with EC numbers for both citrus species. It provides visualizations of metabolic pathways and allows for overlay of RNA-seq expression data. Future work includes manual curation of pathways and development of a Meta-CitrusCyc database.
This document provides an outline for classroom content with a page title and two main items, each with sub-items. Item 1 has three sub-items labeled Sub 1a, Sub 1b, and Sub 1c. Item 2 has four sub-items labeled Sub 2a, Sub 2b, Sub 2c, and Sub 2d.
The document discusses various topics related to drug discovery through bioinformatics and computational approaches. It begins by discussing comparative genomics and using knowledge about model organisms to identify similar biological areas and pathways in other species. It also discusses topics like high-throughput screening of large libraries, the definitions of targets, hits and leads in drug discovery, and approaches like using RNAi and phenotypic screening in model organisms. Finally, it discusses computational methods that can be used throughout the drug discovery process, including for target identification and validation, virtual screening, assessing drug-likeness of compounds, and describing compounds using structural and physicochemical descriptors.
Genomic Big Data Management, Integration and Mining - Emanuel WeitschekData Driven Innovation
This document summarizes genomic big data management, integration and mining. It discusses the exponential growth of biological data due to advances in sequencing technologies. Next generation sequencing techniques generate large amounts of short DNA reads. Several public databases contain heterogeneous biological data sources. Effective data management and integration methods are needed to analyze these large and complex datasets. Supervised machine learning can be used to extract knowledge and classify samples. Tools like CAMUR apply rule-based classification to problems like analyzing gene expression from cancer datasets. Future work involves advanced integration systems and new big data approaches for biological data.
The document discusses metabolic pathway engineering and metabolic engineering. It provides an overview of four commercially important fermentation products, including the microorganism used, annual production levels, and applications. It then discusses the core concepts of metabolic engineering, including manipulating enzymatic and regulatory functions using recombinant DNA to improve cellular activities. Examples of applications include strain improvement for biocatalysis and bioprocessing, increasing productivity, and developing novel biosynthetic routes.
Using ontologies to do integrative systems biologyChris Evelo
The document discusses using ontologies to integrate systems biology data. It describes typical steps in systems biology studies such as finding studies, processing data, integrating data, and combining data from multiple sources. Ontologies can help link information from different analysis techniques and combine data from many studies by capturing study metadata. The document advocates using standards like ISA-TAB and MAGE-TAB to capture study data and proposes using a generic study capture framework with modular components to integrate different types of 'omics data. Ontologies are needed for collaboration and to provide controlled vocabularies for annotation.
Stephen Friend Fanconi Anemia Research Fund 2012-01-21Sage Base
This document summarizes Stephen Friend's presentation on using data intensive science and bionetworks to build better maps of human diseases. It discusses how collecting and integrating massive amounts of molecular and clinical data using open information systems and computing could enable the development of more comprehensive and probabilistic causal models of diseases. These evolving disease maps may help identify causal genes and pathways involved in various conditions. The presentation outlines Sage Bionetworks' mission to create a commons for scientists to collaborate on building and refining such integrative bionetworks to accelerate the elimination of human disease.
This lab aims to analyze gene expression data from a study on the response of human fibroblasts to serum. The study used cDNA microarrays to explore the temporal program of gene expression during this physiological response, identifying genes clustered by their expression patterns. Many features of the transcriptional program appeared related to wound repair processes, suggesting fibroblasts play a richer role than previously thought. The lab will introduce gene expression analysis, demonstrate basic Excel tools for working with microarray data, and use the GEPAS suite to apply the full microarray analysis process to the fibroblast dataset, including preprocessing, clustering, and identifying differentially expressed genes.
The document discusses Emerald Bio's approach to parallel protein purification at the milligram scale using automated multi-target parallel processing (MTPP). Key points include:
- MTPP has delivered over 100 protein structures from over 13 targets, with over 60 containing bound ligands.
- Producing hundreds of protein structures requires thousands of purified proteins, with Emerald Bio purifying over 220 different proteins totaling over 9 grams.
- Emerald Bio's Protein Maker enables high-throughput parallel protein purification of up to 24 samples in a single run from cell lysates or fractions as small as 1 milliliter.
This document proposes using data intensive science to build better models of disease. It notes that current disease models make simplistic assumptions and that personalized medicine requires better representations of overlapping pathways. It advocates adopting the "fourth paradigm" of data intensive science to generate massive datasets, ensure interoperability, create open information systems, and host evolving computational models. Six pilot projects are described that involve collaborative data sharing between industry, academia, and non-profits to build disease maps and models. These include initiatives like CTCAP to share clinical trial data, Arch2POCM to de-risk drug targets, and forming a federation to enhance interoperability. The document argues this approach could help address issues like a lack of standard
1) The document discusses performance analysis of DNA analysis using the Genome Analysis Toolkit (GATK).
2) GATK is a software tool used to analyze sequencing data that enables optimized use of CPU and memory for high-throughput and distributed/parallel processing of DNA data.
3) The document provides details on GATK architecture, how it distributes data into shards for scalable analysis, and how it allows merging of multiple data sources and parallelization of jobs.
This document discusses the importance of open source software and open data in biomedical research. It notes that biological data is growing exponentially and highlights several open source bioinformatics tools like EMBOSS and web services provided by the EBI that enable researchers to access and analyze data. The document advocates for open standards to facilitate data integration and management across different omics domains.
The document discusses how integrative studies can provide insights through combining candidate genomic regions, mitochondrial proteomic data, and cancer expression compendiums to discover genes involved in diseases like Leigh Syndrome and cancers. It also highlights several other studies that have integrated data like DNA sequences, copy numbers, methylation, expression profiles, and pathways to characterize disease subtypes and improve risk stratification for conditions such as glioblastoma multiforme and medulloblastoma. The document presents an example of a translational research study that integrated multiple genomic data types and computational tools in 12 steps to analyze alterations in gene expression and identify potential transcription factor binding sites.
The document discusses the evolution of genomic resources at the National Center for Biotechnology Information (NCBI) over the past 22 years. It shows graphs of the growth in data volumes for resources like GenBank, users accessing services, and the number of human variations cataloged in dbSNP. Key resources highlighted include PubMed, BLAST, Entrez, GenBank, dbSNP, Reference Sequence (RefSeq), Genome Remapping Service, Sequence Read Archive, and more. The document outlines NCBI's role in organizing and providing access to genomic and biomedical literature data.
Summary: ENViz performs enrichment analysis for pathways and gene ontology (GO) terms in matched datasets of multiple data types (e.g. gene expression and metabolites or miRNA), then visualizes results as a Cytoscape network that can be navigated to show data overlaid on pathways and GO DAGs.
Background: Modern genomic, metabolomics, and proteomic assays produce multiplexed measurements that characterize molecular composition and biological activity from complimentary angles. Integrative analysis of such measurements remains a challenge to life science and biomedical researchers. We present an enrichment network approach to jointly analyzing two types of sample matched datasets and systematic annotations, implemented as a plugin to the Cytoscape [1] network biology software platform.
Approach: ENViz analyses a primary dataset (e.g. gene expression) with respect to a ‘pivot’ dataset (e.g. miRNA expression, metabolomics or proteomics measurements) and primary data annotation (e.g. pathway or GO). For each pivot entity, we rank elements of the primary data based on the correlation to the pivot across all samples, and compute statistical enrichment of annotation sets in the top of this ranked list based on minimum hypergeometric statistics [2]. Significant results are represented as an enrichment network - a bipartite graph with nodes corresponding to pivot and annotation entities, and edges corresponding to pivot-annotation pairs with statistical enrichmentscores above the user defined threshold. Correlations of primary data and pivot data are visually overlaid on biological pathways for significant pivot-annotation pairs using the WikiPathways resource [3], and on gene ontology terms. Edges of the enrichment network may point to functionally relevant mechanisms. In [4], a significant association between miR-19a and the cell-cycle module was substantiated as an association to proliferation, validated using a high-throughput transfection assay. The figures below show a pathway enrichment network, with pathway nodes green and miRNAs gray (left), network view of the edge between Inflammatory Response Pathway and mir-337-5p (center), and GO enrichment network with red areas indicating high enrichment for immune response and metabolic processes (right).
This document discusses the marriage of translational medicine and big data. It notes that predicting treatment response to known oncogenes like EGFR is complex and requires detailed understanding of genetic backgrounds. Networks can identify genes causal for disease. The approach uses probabilistic causal network models, with over 80 publications validating the scientific approach. Sage Bionetworks is building disease maps and data repositories through collaborations with industry, foundations, government and academia. Fundamentally, biological science hasn't changed due to omics but iterative networked approaches are needed to generate, analyze and support new disease models.
This document summarizes research characterizing DNA methylation in the Pacific oyster Crassostrea gigas. High-throughput bisulfite sequencing was used to analyze DNA methylation patterns at high resolution. Several genes were found to have different levels and patterns of methylation across tissues and developmental stages. The results provide evidence that DNA methylation plays an important regulatory role and may be involved in environmental responses in C. gigas. Future work will investigate how epigenetic mechanisms are affected by environmental stressors.
The document discusses metagenomics analysis tools and challenges. It summarizes several metagenome analysis portals that provide computational analysis and public sample databases. It also discusses the rapid growth of metagenomic data being produced, challenges around quality control, feature identification, characterization and presentation of metagenomic data, and the need for standardized metadata and data formats. The future directions highlighted include studying strain variation, expanding metadata capture and standards, and developing improved assembly, binning and analysis methods.
This document outlines a talk on protein function and bioinformatics. It discusses why bioinformatics is needed due to the rapid increase in genomic data. It introduces various bioinformatics tools for tasks like sequence analysis, database searches, and structure prediction. As a case study, it examines the genome of the psychrophilic archaeon Methanococcoides burtonii, identifying cold-adaptation features like CSP-like proteins and modified tRNAs. It emphasizes that bioinformatics provides useful predictions but must be integrated with experimental data.
This document describes a new approach to predict protein function in humans by combining large-scale evolutionary analyses with multiple biological data sources. The approach uses 49,231 features derived from sources like sequence similarity, predicted structural characteristics, domain architectures, gene fusions, gene co-expression, and protein-protein interactions to compute a functional similarity score between proteins. This functional similarity score is then used to predict Gene Ontology terms and annotate unannotated human protein sequences. The approach was able to annotate 30% of previously unannotated human protein sequences.
Combining large-scale evolutionary analyses with multiple biological data sources to predict human protein function. The approach uses sequence and structural features, gene expression data, protein interactions, and domain architectures to compute a functional similarity score between proteins. This allows predicting functions for unannotated human proteins, including rare functions. The method was applied to predict Gene Ontology terms for over 20,000 unannotated human proteins, with 16% and 9% having exact matches for molecular function and biological process terms.
A significant amount of time could be saved by cutting out the unnecessary steps from traditional cloning and moving into gene synthesis. Gene synthesis has become a costeffective, time- and resource-saving method for obtaining nearly any desired DNA construct with 100% accuracy. It outperforms conventional molecular biology techniques in terms of time and cost, while providing equivalent or better expression performance, and construct stability and quality. GeneArt® gene synthesis tools gobeyond traditional synthesis and enable expression optimization and maximum performance. Watch this webinar with audio at: http://owl.li/jppYn
Stephen Friend Nature Genetics Colloquium 2012-03-24Sage Base
This document proposes using data intensive science to build models of disease within a shared computing environment or "commons". It notes that current disease models often oversimplify complex conditions. Five pilot projects are described that could leverage shared clinical and genomic data as well as model building to better represent diseases: 1) sharing comparator arm data from clinical trials, 2) a federated aging analysis project, 3) portable legal consent, 4) a Sage Congress modeling competition, and 5) the BRIDGE initiative for democratizing medical research. The document argues this approach could accelerate disease understanding and new therapy development.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving
What began over 115 years ago as a supplier of precision gauges to the automotive industry has evolved into being an industry leader in the manufacture of product branding, automotive cockpit trim and decorative appliance trim. Value-added services include in-house Design, Engineering, Program Management, Test Lab and Tool Shops.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
The Microsoft 365 Migration Tutorial For Beginner.pptxoperationspcvita
This presentation will help you understand the power of Microsoft 365. However, we have mentioned every productivity app included in Office 365. Additionally, we have suggested the migration situation related to Office 365 and how we can help you.
You can also read: https://www.systoolsgroup.com/updates/office-365-tenant-to-tenant-migration-step-by-step-complete-guide/
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots.
To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Introduction of Cybersecurity with OSS at Code Europe 2024
Biocuration2012 Eugeni Belda
1. Eugenio Belda
Laboratory of Bioinformatic Analysis in Genomic and Metabolism (LABGeM team)
CEA/DSV/IG/Genoscope & CNRS UMR8030
2. Introduction
Advances in sequencing technologies has allowed an exponential accumulation
of complete genome sequences in public databases in recent years.
12273 protein
4712 enzymatic
However, wide gap exist activities families (Pfam)
between rapid advances in genome (EC number)
sequencing and slow progress in 25% of 26%
characterization of new protein orphan of unknown
functions reactions functions
?
Genoscope (French National Sequencing Center) has
as one fundamental research objective the extension of in
silico sequence annotations with experimental
characterization of new enzymatic functions (Metabolic
Genomics).
Lab. of Genomics & Biochemistry of Metabolism (LGBM)
Lab. of Organic Chemistry and Biocatalysis (LCOB)
Lab. For enzymatic cloning and screening (LCAB)
Lab. of Bioinformatic Analysis in Genomic and Metabolism
(LABGeM)
3. Three MicroScope components
Process Management
Primary Databank Syntactic Functional / relational > 25 methods :
Update Annotations Analyses
Integrated in a
JBPM Database
workflow
DB Job management system
Release History
=> full automatisation :
PkGDB MicroCyc
• genome annotation
Data Management
• primary data up-to-date
Primary Internal Computational Pathway
Databanks Genomic results Genome
Objects DataBases
Vallenet D. et al.
«MicroScope - a platform for
microbial genome annotation
MaGe Web Interface Keyword search
Blast and Pattern and comparative genomics»
Tutorial
Login Phylogenetic Profile Database 2009
Visualization
Fusion / Fission
Genome overview Tandem duplications
Genome browser Minimal Gene Set Vallenet D, et al.
Data Export and RGPfinder
Synteny maps SNPs / InDels «MaGe - a microbial genome
Artemis annotation system supported
KEGG
MicroCyc by synteny results» Nucleic
CGView
LinePlot
Synton Gene Gene Metabolic Profile Acids Research 2006
display editor card Pathway / Synteny
4. Database Management
Relational DataBase PkGDB
(Prokaryotic Genome DataBase)
EC / reaction
correspondence
• Experimentally elucidated
metabolic pathways
• 1800 pathways from 2216
organisms
(P. Karp, SRI, USA)
Pathway Tools
A metabolic database is built for each annotated microbial genome
PGDB = Pathway/Genome Database (orgname_Cyc)
http://www.genoscope.cns.fr/agc/microcyc
Today: 1233 organisms
(of which 676 public
genomes)
Mapping on the PkGDB
KEGG metabolic
maps
(http://www.kegg.jp/)
5. MicroScope Web site
More than 30 tools are made available to the community
«guest» access
«guest» access
Since 2005, more than
50.000 expert
annotations per year
> 1,000 users, 300 active
www.genoscope.cns.fr/agc/microscope
6. Curation of metabolic data in Microscope
CanOE (Candidate genes for Orphan Enzymes): Method for the automatic integration
of genomic and metabolic contexts, that assists expert functional annotation, especially
in the case of orphan enzymes. Based on the concept of Metabolon (“close” genes in
genome sequence associated to “close” metabolic reactions):
Boyer et. Al; Bioinformatics 2005; Dec 1;21(23):4209-15.
gene gaps
genes
on genome
functional
annotations
? reactions and
compounds in
metabolic network
reaction gap
And ORPHAN
The method provides candidate genes for global/local orphan enzymatic activities
that are located in the “gaps” of metabolons
https://www.genoscope.cns.fr/agc/microscope/metabolism/canoe.php
7. Curation of metabolic data in Microscope
CanOE (Candidate genes for Orphan Enzymes)
Example: Allantoin degradation metabolon in E. coli K12
2.1.3.5 is a global orphan reaction (no associated to any gene in any
organism)
Three candidate genes for EC:2.1.3.5 reaction
None share any significant similarities with kown carbamoytransferases
Protein expression and biochemical assays under way
Smith AAT, Belda E., Viari A., Médigue C., and Vallenet D. “The CanOE strategy: integrating genomic and metabolic contexts across multiple
prokaryote genomes to find candidate genes for orphan enzymes” (Plos Computational Biology, In revision)
8. Curation of metabolic data in Microscope
GPR curation interface: In the context of network reconstruction, is essential the
definition of Gene-Protein-Reaction associations (Genes encoding
enzymes/complexes/isozymes catalyzing a particular metabolic reaction):
Thiele & Palsson; Nat Protoc. 2010;5(1):93-121
9. Curation of metabolic data in Microscope
GPR curation interface: The gene curation interface of Microscope allows the
validation of Gene-Reaction associations based on curated gene annotations. Two
reference reaction resources availables, MetaCyc (functional) and RHEA (under
development):
4.1.3.27, 2.4.2.18 Automatic retrieval of
Metacyc/Rhea
reactions based on
EC number
Keyword
search
10. Curation of metabolic data in Microscope
Pathway validation interface: Validation/curation of automatically projected MetaCyc
pathways based on Gene-Reaction associations:
11. Projet Microme : www.microme.eu
A Knowledge-Based Bioinformatics Framework
for Microbial Pathway Genomics
AMAbiotics
Purpose : develop bioinformatics infrastructures, Centro Nacional
together with a projection and curation process, in de Biotecnología
order to generate : CEA-Genoscope
- complete metabolic pathways from genome European
Bioinformatics
annotations Center for research
Institute
- whole-cell metabolic models from pathway and Technology
German Collection of
Hellas
assemblies Microorganisms and
Cell Cultures
ISTHMUS Spanish National
Experimentally validation of metabolic model Cancer Centre
using growth phenotype data (i.e, BIOLOG Molecular Tel-Aviv
experiments) generated within the project for a Networks University
subset of selected species.
Université
Swiss Institute of
Libre de
Bioinformatics
Bruxelles
Analytical tools are integrated for comparative
and phylogenetic analysis based on projected Wageningen
Wellcome Trust
pathways and metabolic models Sanger Institute University
12. Microme WP2: Objectives
Provide EU with a curated microbial metabolic resource
Implement a unique cyclic and colaborative curation process for metabolic data
Unification of existing metabolic resources:
Pivot resources: ChEBI (chemical compounds) and Rhea (chemical reactions)
Cross-references External resources (compounds, reactions, pathways):
KEGG, MetaCyc, Metabolic models
Alcantara R., Axelsen K.B., Morgat A., Belda E., Coudert E., Bridge A., Cao H., de Matos P., Ennis M., Turner S., Owen G., Bougueleret
L., Xenarios I., and Steinbeck C. (2012) Rhea - a manually curated resource of biochemical reactions. Nucleic Acids Research. 40, D754-
D760, Database issue.
MicroScope and Microme
Use MicroScope as reference resource of curated GPR (Gene Protein Reaction)
associations for microbial genomes included in Microme project
Development of novel interfaces for GPR curation in Microscope environment. Retrieval
of METACYC and RHEA reactions for a particular gene object from EC number annotations
13. MicroScope and Microme
Development of web-services to provide Microme partners with curated Gene-
Reaction associations from Microscope platform
Curation tool
Reconstruction
microcyc Each night PkGDB
Web-services
14. Test-case: Bacillus subtilis 168 re-annotation
Second most intensively studied bacterium after Escherichia coli, being a model
organism for Gram-positive bacteria
Genome sequenced in
1997. 4,214 Megabases, 4000
CDSs
Nature 1997 Nov 20;390(6657):249-56
Re-sequencing and first re-
annotation of the genome in
2009
Microbiology (2009), 155, 1758-1775
Re-annotation of the genome in the context of Microme project with special
focus in the curation of Gene-Reaction associations by using Microscope metabolic
tools and curation interface. Collaborative work LABGeM (CEA)-SIB-AMAbiotics
(Antoine Danchin)
15. Test-case: Bacillus subtilis 168 re-annotation
Starting data for curation of Gene-Reaction associations
Predicted MetaCyc
reaction; BBH relationship
with E. coli CDSs
Predicted MetaCyc
reaction; No BBH
310 CDSs
relationship with E. coli
531 CDSs CDSs
909 CDSs
508 CDSs 378 CDSs "Putative enzymes" in
Product type annotation;
No predicted MetaCyc
reaction
"Enzymes" in Product type
annotation; No predicted
MetaCyc reaction
16. Test-case: Bacillus subtilis 168 re-annotation
From the 909 CDS with predicted reaction
531 with BBH in E. coli:
416 with same GPR in B. Automatic validation of Gene-
subtilis and E. coli (EcoCyc) Reaction associations
115 CDS with different GPR in
B. subtilis and E. coli (EcoCyc) Manual curation of Gene-Reaction
associations in Microscope
378 without BBH in E. coli: environment
254 with GPR predicted from Sequence similarity profiles
the curated EC number
Genomic context
124 with GPR predicted from
conservation
“product” annotation
310 CDS with “enzyme” annotation and Integration of genomic and
without predicted reaction metabolic context (CanOE
strategy)
508 CDS with “enzyme” annotation and
without predicted reaction: Filter by
Co-evolution patterns of
Catalytic activity field in SwissProt
annotations (41 CDSs)
functionally related genes
17. Test-case: Bacillus subtilis 168 re-annotation
Problems associated to
automatic predictions of Gene-
Reaction associations. Example:
Generic EC number definition
associated to multiple specific No experimental
reaction instances in MetaCyc evidence of activity ;
generic product
annotation
17 predicted reactions based
on EC:1.2.1.3 annotation.
Problems in terms of
modelling purposes
Without experimental
evidence of specific
substrates, only generic
reaction has been validated
18. Test-case: Bacillus subtilis 168 re-annotation
Stats of curation Gene-Reaction associations in Microscope
1022
Nº reactions Initial Gene-
985 (388)
Reaction
predictions
901 (Pathway Tools)
Nº CDS
1006 (517)
Current Gene-
Nº Gene-Reaction 1549 Reaction
associations 1406 (715) associations
(Manually Curated)
0 500 1000 1500 2000
105 CDS without
automatically predicted 147 new reactions added (not
reaction in initial originally predicted)
projections 184 originally predicted
reactions removed
19. Test-case: Bacillus subtilis 168 re-annotation
17 possible updates of SwissProt annotations Reported to
SwissProt/IUBMB
6 possible new EC numbers curators
13 possible new metabolic pathways/pathway variants not presents in MetaCyc
Biotin biosynthesis pathway variant
Lipoate biosynthesis pathway variant
New Myoinositol catabolism pathway variant
pathway Rhamnogalacturonan type I degradation pathway variant
variants Acetoin dehydrogenase pathway variant
Methionin salvage pathway variant
Bacillaene biosynthesis pathway
Aerobic respiration pathway variants
Aromatic polyketide biosynthesis pathway
New 2-methylthio-N6-threocarbamoyladenosine biosynthesis
metab. Bacilysocin biosynthesis
pathways Archaeal-type ether lipid biosynthesis
Bacillaene biosynthesis pathway
Methionine-Cysteine interconversion
20. Test-case: Bacillus subtilis 168 re-annotation
Biotin biosynthesis pathway variant: Update of DAP aminotransferase pathway variant
(EC:2.6.1.62)
KEGG pathway (map00780) MetaCyc pathway (PWY-5005)
S-Adenosyl-L-
methionine as amino
group donor
L-lysine instead S-adenosyl-
Methionine as amino group donor in
Bacillus subtilis BioA enzyme
21. Test-case: Bacillus subtilis 168 re-annotation
Biotin biosynthesis pathway variant: Link with fatty acid metabolism. Improvement of
genome-scale metabolic models
iBsu1103: Most up-to-date B. subtilis 168 metabolic model (SEED
methodology; 1437 reactions, 1103 genes). Henry CS, Zinner JF, Cohoon MP, Stevens RL.
Genome Biol. 2009;10(6):R69
Dead-end
metabolite
Auxotrophic for
EX_pimelate Biotin
biosynthesis
FBA simulations iBsu1103 model
122.97 122.97 122.97
140.00
Not included in
Biomass prod. rate
120.00
Biomass equation 100.00
80.00
60.00
EX_biotin 40.00
0.00
20.00
0.00
iBsu1103 iBsu1103; Biotin iBsu1103; iBsu1103;
in Biomass External influx External influx
Pimelate Biotin
22. Test-case: Bacillus subtilis 168 re-annotation
BioI enzyme of B. subtilis 168: cytochrome
P450 protein that catalyzes the oxidative
cleavage of acyl-ACP/free fatty acid molecules
generated in the context of fatty acid
biosynthesis yielding pimeloyl-ACP as primary
product.
Fatty acids An Acyl-ACP
metabolism BioI (BSU30190) L-Alanine+H+
Pimeloyl-ACP BioF (BSU30220)
CO2+HoloACP
A fatty acid
BioI
(BSU30190)
23. Future work
Extension of the reference set of Microme species to:
Acinetobacter sp. ADP1
Pseudomonas putida KT2440
Bacillus subtilis 168
Second version of Gene-Reaction curation interface in Microscope
environment:
Curation of protein complexes / Isozyme sets
Management of Rhea reactions in addition of MetaCyc reactions
Definition of strategies for vertical annotation and propagation of curated
GPR across multiple microbial genomes
Use UniPathway as reference resource of metabolic pathways in Microscope;
Specie-specific pathway representations based on Pathway modules
combination (http://www.unipathway.org)
24. Contributions
Claudine Médigue (Group Leader)
David Vallenet (Researcher)
Damien Monrico (Engineer)
François Lefèvre (Engineer)
Alexander T. Smith (PhD)
Eugeni Belda (Post doc)
IT team Claude Scarpelli
Ludovic Fleury
External partners
Anne Morgat Antoine Danchin
Foundings
EU Framework Programme 7 Collaborative
Project. Grant Agreement Number 222886-2