These are the slides from the METASPACE Training Course given at OurCon'16.
METASPACE is a European Horizon2020 project on Bioinformatics for Spatial Metabolomics. Specifically, it aims at developing an engine for metabolite annotation of HR imaging mass spectrometry data. The project was funded in the Personalizing Health and Care program for 3 years (2015-2018) and is coordinated by the European Molecular Biology Laboratory.
The slides are organized into two parts.
Part 1: Introduction of the project, the bioinformatics behind it, and the online engine.
Part 2: Step-by-step tutorial on how to use the online engine for annotating metabolites from HR imaging mass spectrometry data.
For more information on METASPACE, please visit the project website http://metaspace2020.eu, twitter @metaspace2020, or email us at contact@metaspace2020.eu.
ARCHIVED: new version available - METASPACE Step by Step guideMETASPACE
These slides provide a guide to using the METASPACE platform for annotating metabolites in high resolving power imaging mass spectrometry datasets. It describes
* the science behind molecular annotation
* how to use our web application to upload, browse, interpret and export annotations from the platform.
These slides present the bioinformatics for metabolite annotation of HR imaging MS. The bioinformatics was developed in the framework of the METASPACE project.
METASPACE is a European Horizon2020 project on Bioinformatics for Spatial Metabolomics. Specifically, it aims at developing an engine for metabolite annotation of HR imaging mass spectrometry data. The project was funded in the Personalizing Health and Care program for 3 years (2015-2018) and is coordinated by the European Molecular Biology Laboratory.
The presentation was given at the METASPACE Training Course at OurCon'17 on 25.10.2017.
For more information on METASPACE, please visit the project website http://metaspace2020.eu, twitter @metaspace2020, or email us at contact@metaspace2020.eu.
The document discusses the ArrayExpress team's experience with MAGE-TAB, a tab-delimited format for representing microarray data. Key points include:
- ArrayExpress has integrated MAGE-TAB into its data acquisition and plans to convert all existing data to MAGE-TAB.
- MAGE-TAB allows for more efficient curation and user updates compared to previous formats.
- ArrayExpress is working to extend MAGE-TAB to represent additional data types beyond microarrays and developing ontologies to support MAGE-TAB.
- Tools and validation procedures for working with MAGE-TAB are being made publicly available.
1. Materials Informatics uses Python tools like RDKit for analyzing molecular structures and properties.
2. ORGAN and MolGAN are two generative models that use GANs to generate novel molecular structures based on SMILES strings, with ORGAN incorporating reinforcement learning to optimize for desired properties.
3. Tools like RDKit enable analyzing molecular fingerprints and descriptors that can be used for machine learning applications in materials informatics.
The US-EPA Chemicals Dashboard is an integrated data hub providing centralized access to environmental chemistry data to support EPA and partner decision making. It contains nearly 900,000 chemical substances with experimental and predicted physical/chemical properties, hazard, exposure, and toxicity data. Users can perform various search types including basic searches by chemical name/identifier, structure-based searches, and batch searches for lists of chemicals. Detailed chemical pages display curated data from sources like ToxCast/Tox21 along with linked analogous chemicals and real-time predictive models. The goal is to improve efficiency in chemical risk assessment through easy access to this centralized chemical data resource.
OpenML is a platform that aims to organize machine learning data, experiments, and models. It provides APIs and tools to allow users to easily find and reuse datasets, run algorithms on tasks to generate model evaluations, and share results. All experiments are automatically logged and linked, enabling comparisons to other results and improving reproducibility. The goal is for OpenML to enhance ML research by removing friction and facilitating collaboration through its organized resources and network effects.
The document discusses how the EPA's CompTox Chemicals Dashboard can be used to support mass spectrometry analyses for structure identification. The Dashboard contains data on over 800,000 chemicals including properties, lists, and links to other resources. It allows searching by formula, structure, and mass to find related chemicals. Candidate structures can be ranked using metadata. Predicted mass spectra from over 800,000 structures may also be accessible. The Dashboard integrates data to help identify unknown chemicals detected by mass spectrometry.
The phrase “Big Data” is generally used to describe a large volume of structured and/or unstructured data that cannot be processed using traditional database and software techniques. In the domain of chemistry the Royal Society of Chemistry certainly hosts large structured databases of chemistry data, for example ChemSpider, as well as unstructured content, in the form of our collection of scientific articles. Our research literature provides value to their readership and, at present, as an example of one of our databases, ChemSpider is accessed by many tens of thousands of scientists every day. But do these collections constitute “Big Data” or is it the potential which lies within the collections that can contribute to the Big Data movement. This presentation will discuss our activities to contribute both data, and service-based access to our data sets, to support grant-based projects such as the Innovative Medicines Initiative Open PHACTS project (to support drug discovery) and the PharmaSea initiative (to identify novel natural products from the ocean). We will also provide an overview of our activities to perform data mining of public patent collections and examine what can be done with the data. We are presently extracting physicochemical properties and textual forms of NMR spectra and, with the resulting data, are building predictive models (for melting points at present) and assembling a large NMR spectral database containing many hundreds of thousands of spectral-structure pairs. Our experiences to date have demonstrated that we are working at the edge of current algorithmic and computing capabilities for predictive model building, with over a quarter of a million melting points producing a matrix of over 200 billion descriptors. Our work to produce the NMR spectral database will necessitate batch processing of the data to examine consistency between the spectral-structure pairs and other forms of data validation. The intention is to take our experiences in this work applied to a public patents corpus and apply it to the RSC back file of publications to mine data and enable new paths to the discoverability of both data and the associated publications.
ARCHIVED: new version available - METASPACE Step by Step guideMETASPACE
These slides provide a guide to using the METASPACE platform for annotating metabolites in high resolving power imaging mass spectrometry datasets. It describes
* the science behind molecular annotation
* how to use our web application to upload, browse, interpret and export annotations from the platform.
These slides present the bioinformatics for metabolite annotation of HR imaging MS. The bioinformatics was developed in the framework of the METASPACE project.
METASPACE is a European Horizon2020 project on Bioinformatics for Spatial Metabolomics. Specifically, it aims at developing an engine for metabolite annotation of HR imaging mass spectrometry data. The project was funded in the Personalizing Health and Care program for 3 years (2015-2018) and is coordinated by the European Molecular Biology Laboratory.
The presentation was given at the METASPACE Training Course at OurCon'17 on 25.10.2017.
For more information on METASPACE, please visit the project website http://metaspace2020.eu, twitter @metaspace2020, or email us at contact@metaspace2020.eu.
The document discusses the ArrayExpress team's experience with MAGE-TAB, a tab-delimited format for representing microarray data. Key points include:
- ArrayExpress has integrated MAGE-TAB into its data acquisition and plans to convert all existing data to MAGE-TAB.
- MAGE-TAB allows for more efficient curation and user updates compared to previous formats.
- ArrayExpress is working to extend MAGE-TAB to represent additional data types beyond microarrays and developing ontologies to support MAGE-TAB.
- Tools and validation procedures for working with MAGE-TAB are being made publicly available.
1. Materials Informatics uses Python tools like RDKit for analyzing molecular structures and properties.
2. ORGAN and MolGAN are two generative models that use GANs to generate novel molecular structures based on SMILES strings, with ORGAN incorporating reinforcement learning to optimize for desired properties.
3. Tools like RDKit enable analyzing molecular fingerprints and descriptors that can be used for machine learning applications in materials informatics.
The US-EPA Chemicals Dashboard is an integrated data hub providing centralized access to environmental chemistry data to support EPA and partner decision making. It contains nearly 900,000 chemical substances with experimental and predicted physical/chemical properties, hazard, exposure, and toxicity data. Users can perform various search types including basic searches by chemical name/identifier, structure-based searches, and batch searches for lists of chemicals. Detailed chemical pages display curated data from sources like ToxCast/Tox21 along with linked analogous chemicals and real-time predictive models. The goal is to improve efficiency in chemical risk assessment through easy access to this centralized chemical data resource.
OpenML is a platform that aims to organize machine learning data, experiments, and models. It provides APIs and tools to allow users to easily find and reuse datasets, run algorithms on tasks to generate model evaluations, and share results. All experiments are automatically logged and linked, enabling comparisons to other results and improving reproducibility. The goal is for OpenML to enhance ML research by removing friction and facilitating collaboration through its organized resources and network effects.
The document discusses how the EPA's CompTox Chemicals Dashboard can be used to support mass spectrometry analyses for structure identification. The Dashboard contains data on over 800,000 chemicals including properties, lists, and links to other resources. It allows searching by formula, structure, and mass to find related chemicals. Candidate structures can be ranked using metadata. Predicted mass spectra from over 800,000 structures may also be accessible. The Dashboard integrates data to help identify unknown chemicals detected by mass spectrometry.
The phrase “Big Data” is generally used to describe a large volume of structured and/or unstructured data that cannot be processed using traditional database and software techniques. In the domain of chemistry the Royal Society of Chemistry certainly hosts large structured databases of chemistry data, for example ChemSpider, as well as unstructured content, in the form of our collection of scientific articles. Our research literature provides value to their readership and, at present, as an example of one of our databases, ChemSpider is accessed by many tens of thousands of scientists every day. But do these collections constitute “Big Data” or is it the potential which lies within the collections that can contribute to the Big Data movement. This presentation will discuss our activities to contribute both data, and service-based access to our data sets, to support grant-based projects such as the Innovative Medicines Initiative Open PHACTS project (to support drug discovery) and the PharmaSea initiative (to identify novel natural products from the ocean). We will also provide an overview of our activities to perform data mining of public patent collections and examine what can be done with the data. We are presently extracting physicochemical properties and textual forms of NMR spectra and, with the resulting data, are building predictive models (for melting points at present) and assembling a large NMR spectral database containing many hundreds of thousands of spectral-structure pairs. Our experiences to date have demonstrated that we are working at the edge of current algorithmic and computing capabilities for predictive model building, with over a quarter of a million melting points producing a matrix of over 200 billion descriptors. Our work to produce the NMR spectral database will necessitate batch processing of the data to examine consistency between the spectral-structure pairs and other forms of data validation. The intention is to take our experiences in this work applied to a public patents corpus and apply it to the RSC back file of publications to mine data and enable new paths to the discoverability of both data and the associated publications.
The document provides an overview of a METASPACE training guide on metabolite annotation in high-resolution imaging mass spectrometry data. It covers three parts: introduction to the METASPACE platform and annotation process, a tutorial on using the annotation engine and knowledgebase, and exporting data to the required imzML format from different mass spectrometers. The tutorial teaches participants how to prepare and submit data, browse results, and interpret annotations from the METASPACE bioinformatics tools in order to annotate metabolites in imaging MS data.
ProFET - Protein Feature Engineering ToolkiDan Ofer
Summary of the ProFET project.
This is a newly developed toolkit for end to end machine learning and feature extraction from proteins.
The Code can be freely downloaded here:
https://github.com/ddofer/ProFET
Dan Ofer
Integrative information management for systems biologyNeil Swainston
The MCISB develops experimental and computational technologies in systems biology. It employs 9.5 multidisciplinary people to develop kinetic models of yeast metabolism using genome-scale SBML models annotated with MIRIAM standards. The modeling process involves identifying pathways to model, associating models with functions and parameter values, and analyzing/simulating resulting models.
The document discusses various research projects involving the automated design and optimization of complex physical, chemical, and biological systems using evolutionary algorithms and machine learning techniques. It describes current and planned usage of computer clusters to run simulations and experiments for protein structure prediction, software self-assembly, and modeling physico-chemical systems through evolutionary optimization of parameters. The research requires significant computational resources to process large datasets and evaluate models in parallel.
Informatics In The Manchester Centre For Integrative Systems BiologyNeil Swainston
The MCISB employs 9.5 multidisciplinary people who share an office and lab. They follow an iterative and integrative approach to develop an annotated, kinetic model of yeast metabolism. To integrate experimental data with models, they utilize unique, public identifiers for molecules from databases like ChEBI and UniProt. They have developed tools like KineticsWizard to help experimentalists capture identifier data. Their annotated yeast model follows MIRIAM standards to unambiguously identify over 2000 molecules with database references.
This document summarizes an presentation about OpenML, an online platform for sharing machine learning data and experiments. OpenML allows users to search datasets, build machine learning models using various tools/APIs, run experiments on tasks, and automatically upload results. This facilitates reproducibility, benchmarking, and reuse of prior work. OpenML also aims to advance automated machine learning through meta-learning techniques that leverage the large amount of shared data and experiments.
The document discusses metagenomics analysis tools and challenges. It summarizes several metagenome analysis portals that provide computational analysis and public sample databases. It also discusses the rapid growth of metagenomic data being produced, challenges around quality control, feature identification, characterization and presentation of metagenomic data, and the need for standardized metadata and data formats. The future directions highlighted include studying strain variation, expanding metadata capture and standards, and developing improved assembly, binning and analysis methods.
The document describes a fully automated system called AutoChrom for chromatographic method development using LC/MS/DAD detection. AutoChrom aims to streamline the method development workflow by automating instrument control, interpreting data, managing information, and contributing to workflow. It utilizes several techniques including mutual automated peak matching of UV and MS data, composite chromatograms of multiple samples, and concise reporting to organize and communicate results from complex method development projects involving large amounts of hyphenated data.
This document discusses genomic meta-analysis and summarization techniques. It introduces MetaQC for quality control, MetaDE for detecting differentially expressed genes through meta-analysis, and MetaPCA for integrative visualization of multiple genomic studies. MetaQC uses quality measures to determine inclusion/exclusion of studies in meta-analysis. MetaDE detects biomarkers statistically significant across studies using Fisher's and adaptive weighting methods. MetaPCA integrates multiple genomic datasets by finding a common principal component space.
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
The document discusses various mass spectrometry file formats used in proteomics workflows, including the advantages of XML-based formats like mzML and mzIdentML that support metadata and can be read by different software. It also describes challenges with proprietary binary formats and efforts to develop common data standards and APIs through projects like ProteoWizard, PRIDE, and the ms-core-api library. Standard file formats are important for sharing and reusing proteomics data over time as instrumentation and software evolve.
Use of spark for proteomic scoring seattle presentationlordjoe
This document discusses using Apache Spark to parallelize proteomic scoring, which involves matching tandem mass spectra against a large database of peptides. The author developed a version of the Comet scoring algorithm and implemented it on a Spark cluster. This outperformed single machines by over 10x, allowing searches that took 8 hours to be done in under 30 minutes. Key considerations for running large jobs in parallel on Spark are discussed, such as input formatting, accumulator functions for debugging, and smart partitioning of data. The performance improvements allow searching larger databases and considering more modifications.
The document describes a software system being developed to visually monitor the workload of cores in a high-performance manycore computer architecture. The system receives data about the state of cores in a computing system, analyzes the data, and displays it visually with remote web access. Compared to other software for visually monitoring multiprocessor systems, this system provides a visual display of processed data on the state of cores based on analysis of inter-core messages and characteristics of individual cores. The system is being developed using Microsoft Visual Studio 2008 on a 16-core Windows cluster at Polytechnic University and will aid in analyzing and monitoring complex systems and their components during different workload modes.
1) Data analytics involves treating available digital data as a "gold mine" to obtain tangible outputs that can improve business efficiency when applied. Machine learning uses algorithms to correlate parameters in data and improve relationships.
2) The document provides an overview of getting started in data science, covering business objectives, statistical analysis, programming tools like R and Python, and problem-solving approaches like supervised and unsupervised learning.
3) It describes the iterative "rule of seven" process for data science projects, including collecting/preparing data, exploring/analyzing it, transforming features, applying models, evaluating performance, and visualizing results.
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
This document provides an introduction and overview of tutorials for metabolomic data analysis. It discusses downloading required files and software. The goals of the analysis include using statistical and multivariate analyses to identify differences between sample groups and impacted biochemical domains. It also discusses various data analysis techniques including data quality assessment, univariate and multivariate statistical analyses, clustering, principal component analysis, partial least squares modeling, functional enrichment analysis, and network mapping.
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
Spark 2.0 provided strong performance enhancements to the Spark core while advancing Spark ML usability to use data frames. But what happens when you run Spark 2.0 machine learning algorithms on a large cluster with a very large data set? Do you even get any benefit from using a very large data set? It depends. How do new hardware advances affect the topology of high performance Spark clusters. In this talk we will explore Spark 2.0 Machine Learning at scale and share our findings with the community.
As our test platform we will be using a new cluster design, different from typical Hadoop clusters, with more cores, more RAM and latest generation NVMe SSD’s and a 100GbE network with a goal of more performance, in a more space and energy efficient footprint.
Automating Machine Learning - Is it feasible?Manuel Martín
Facing a machine learning problem for the first time can be overwhelming. Hundreds of methods exist for tackling problems such as classification, regression or clustering. Selecting the appropriate method is challenging, specially if no much prior knowledge is known. In addition, most models require to optimise a number of hyperparameters to perform well. Preparing the data for the learning algorithm is also a labour-intensive process that includes cleaning outliers and imperfections, feature selection, data transformation like PCA and more. A workflow connecting preprocessing methods and predictive models is called a multicomponent predictive system (MCPS). This talk introduces the problem of automating the composition and optimisation of MCPSs and also how they can be adapted in changing environments.
A good foundation has been established for both data mining research and genuine
application based data mining. The current functionality of EMADS is limited
to classification and Meta-ARM. The research team is at present working towards
increasing the diversity of mining tasks that EMADS can address. There are many
directions in which the work can (and is being) taken forward. One interesting direction
is to build on the wealth of distributed data mining research that is currently
available and progress this in an MAS context. The research team are also enhancing
the system’s robustness so as to make it publicly available. It is hoped that once
the system is live other interested data mining practitioners will be prepared to contribute
algorithms and data.
Fostering Serendipity through Big Linked DataMuhammad Saleem
This document discusses fostering serendipity through linking large biomedical datasets. It linked over 30 billion triples from The Cancer Genome Atlas (TCGA) and over 23 million publications from PubMed. It developed an architecture called TopFed to continuously integrate new data through parallel querying. TopFed was evaluated against the FedX system and shown to have significantly better performance, with query runtimes over 75 times faster for some queries. A visualization interface was also created to explore the linked data.
The document discusses various methods and challenges for identifying compounds based on limited information such as mass, name, fingerprint, or spectral data. It describes searching public databases, calculating elemental compositions, comparing spectra, and predicting fragmentation patterns to identify molecules or narrow down candidates. Even with this information, unique identification can be difficult, and integration of additional data types may be needed.
This document describes the design and implementation of an integrated system called MPDB for the storage and analysis of metabolomics data. MPDB was created as a free open-source laboratory information system tailored for the metabolomics workflow. It includes tools for raw data cleanup, compound identification, peak alignment across samples, data normalization, and statistical analysis. The system pipeline allows users to efficiently store large amounts of analytical results and associated biological metadata, perform multi-sample analysis and data mining, and gain new biological insights from metabolomics experiments. As an example application, the document outlines a study analyzing the effects of nitrogen stress on the leaf metabolism of Populus trees using MPDB.
Microbial interaction
Microorganisms interacts with each other and can be physically associated with another organisms in a variety of ways.
One organism can be located on the surface of another organism as an ectobiont or located within another organism as endobiont.
Microbial interaction may be positive such as mutualism, proto-cooperation, commensalism or may be negative such as parasitism, predation or competition
Types of microbial interaction
Positive interaction: mutualism, proto-cooperation, commensalism
Negative interaction: Ammensalism (antagonism), parasitism, predation, competition
I. Mutualism:
It is defined as the relationship in which each organism in interaction gets benefits from association. It is an obligatory relationship in which mutualist and host are metabolically dependent on each other.
Mutualistic relationship is very specific where one member of association cannot be replaced by another species.
Mutualism require close physical contact between interacting organisms.
Relationship of mutualism allows organisms to exist in habitat that could not occupied by either species alone.
Mutualistic relationship between organisms allows them to act as a single organism.
Examples of mutualism:
i. Lichens:
Lichens are excellent example of mutualism.
They are the association of specific fungi and certain genus of algae. In lichen, fungal partner is called mycobiont and algal partner is called
II. Syntrophism:
It is an association in which the growth of one organism either depends on or improved by the substrate provided by another organism.
In syntrophism both organism in association gets benefits.
Compound A
Utilized by population 1
Compound B
Utilized by population 2
Compound C
utilized by both Population 1+2
Products
In this theoretical example of syntrophism, population 1 is able to utilize and metabolize compound A, forming compound B but cannot metabolize beyond compound B without co-operation of population 2. Population 2is unable to utilize compound A but it can metabolize compound B forming compound C. Then both population 1 and 2 are able to carry out metabolic reaction which leads to formation of end product that neither population could produce alone.
Examples of syntrophism:
i. Methanogenic ecosystem in sludge digester
Methane produced by methanogenic bacteria depends upon interspecies hydrogen transfer by other fermentative bacteria.
Anaerobic fermentative bacteria generate CO2 and H2 utilizing carbohydrates which is then utilized by methanogenic bacteria (Methanobacter) to produce methane.
ii. Lactobacillus arobinosus and Enterococcus faecalis:
In the minimal media, Lactobacillus arobinosus and Enterococcus faecalis are able to grow together but not alone.
The synergistic relationship between E. faecalis and L. arobinosus occurs in which E. faecalis require folic acid
More Related Content
Similar to ARCHIVED: new version available. 2016 - METASPACE Training Course
The document provides an overview of a METASPACE training guide on metabolite annotation in high-resolution imaging mass spectrometry data. It covers three parts: introduction to the METASPACE platform and annotation process, a tutorial on using the annotation engine and knowledgebase, and exporting data to the required imzML format from different mass spectrometers. The tutorial teaches participants how to prepare and submit data, browse results, and interpret annotations from the METASPACE bioinformatics tools in order to annotate metabolites in imaging MS data.
ProFET - Protein Feature Engineering ToolkiDan Ofer
Summary of the ProFET project.
This is a newly developed toolkit for end to end machine learning and feature extraction from proteins.
The Code can be freely downloaded here:
https://github.com/ddofer/ProFET
Dan Ofer
Integrative information management for systems biologyNeil Swainston
The MCISB develops experimental and computational technologies in systems biology. It employs 9.5 multidisciplinary people to develop kinetic models of yeast metabolism using genome-scale SBML models annotated with MIRIAM standards. The modeling process involves identifying pathways to model, associating models with functions and parameter values, and analyzing/simulating resulting models.
The document discusses various research projects involving the automated design and optimization of complex physical, chemical, and biological systems using evolutionary algorithms and machine learning techniques. It describes current and planned usage of computer clusters to run simulations and experiments for protein structure prediction, software self-assembly, and modeling physico-chemical systems through evolutionary optimization of parameters. The research requires significant computational resources to process large datasets and evaluate models in parallel.
Informatics In The Manchester Centre For Integrative Systems BiologyNeil Swainston
The MCISB employs 9.5 multidisciplinary people who share an office and lab. They follow an iterative and integrative approach to develop an annotated, kinetic model of yeast metabolism. To integrate experimental data with models, they utilize unique, public identifiers for molecules from databases like ChEBI and UniProt. They have developed tools like KineticsWizard to help experimentalists capture identifier data. Their annotated yeast model follows MIRIAM standards to unambiguously identify over 2000 molecules with database references.
This document summarizes an presentation about OpenML, an online platform for sharing machine learning data and experiments. OpenML allows users to search datasets, build machine learning models using various tools/APIs, run experiments on tasks, and automatically upload results. This facilitates reproducibility, benchmarking, and reuse of prior work. OpenML also aims to advance automated machine learning through meta-learning techniques that leverage the large amount of shared data and experiments.
The document discusses metagenomics analysis tools and challenges. It summarizes several metagenome analysis portals that provide computational analysis and public sample databases. It also discusses the rapid growth of metagenomic data being produced, challenges around quality control, feature identification, characterization and presentation of metagenomic data, and the need for standardized metadata and data formats. The future directions highlighted include studying strain variation, expanding metadata capture and standards, and developing improved assembly, binning and analysis methods.
The document describes a fully automated system called AutoChrom for chromatographic method development using LC/MS/DAD detection. AutoChrom aims to streamline the method development workflow by automating instrument control, interpreting data, managing information, and contributing to workflow. It utilizes several techniques including mutual automated peak matching of UV and MS data, composite chromatograms of multiple samples, and concise reporting to organize and communicate results from complex method development projects involving large amounts of hyphenated data.
This document discusses genomic meta-analysis and summarization techniques. It introduces MetaQC for quality control, MetaDE for detecting differentially expressed genes through meta-analysis, and MetaPCA for integrative visualization of multiple genomic studies. MetaQC uses quality measures to determine inclusion/exclusion of studies in meta-analysis. MetaDE detects biomarkers statistically significant across studies using Fisher's and adaptive weighting methods. MetaPCA integrates multiple genomic datasets by finding a common principal component space.
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
The document discusses various mass spectrometry file formats used in proteomics workflows, including the advantages of XML-based formats like mzML and mzIdentML that support metadata and can be read by different software. It also describes challenges with proprietary binary formats and efforts to develop common data standards and APIs through projects like ProteoWizard, PRIDE, and the ms-core-api library. Standard file formats are important for sharing and reusing proteomics data over time as instrumentation and software evolve.
Use of spark for proteomic scoring seattle presentationlordjoe
This document discusses using Apache Spark to parallelize proteomic scoring, which involves matching tandem mass spectra against a large database of peptides. The author developed a version of the Comet scoring algorithm and implemented it on a Spark cluster. This outperformed single machines by over 10x, allowing searches that took 8 hours to be done in under 30 minutes. Key considerations for running large jobs in parallel on Spark are discussed, such as input formatting, accumulator functions for debugging, and smart partitioning of data. The performance improvements allow searching larger databases and considering more modifications.
The document describes a software system being developed to visually monitor the workload of cores in a high-performance manycore computer architecture. The system receives data about the state of cores in a computing system, analyzes the data, and displays it visually with remote web access. Compared to other software for visually monitoring multiprocessor systems, this system provides a visual display of processed data on the state of cores based on analysis of inter-core messages and characteristics of individual cores. The system is being developed using Microsoft Visual Studio 2008 on a 16-core Windows cluster at Polytechnic University and will aid in analyzing and monitoring complex systems and their components during different workload modes.
1) Data analytics involves treating available digital data as a "gold mine" to obtain tangible outputs that can improve business efficiency when applied. Machine learning uses algorithms to correlate parameters in data and improve relationships.
2) The document provides an overview of getting started in data science, covering business objectives, statistical analysis, programming tools like R and Python, and problem-solving approaches like supervised and unsupervised learning.
3) It describes the iterative "rule of seven" process for data science projects, including collecting/preparing data, exploring/analyzing it, transforming features, applying models, evaluating performance, and visualizing results.
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
This document provides an introduction and overview of tutorials for metabolomic data analysis. It discusses downloading required files and software. The goals of the analysis include using statistical and multivariate analyses to identify differences between sample groups and impacted biochemical domains. It also discusses various data analysis techniques including data quality assessment, univariate and multivariate statistical analyses, clustering, principal component analysis, partial least squares modeling, functional enrichment analysis, and network mapping.
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
Spark 2.0 provided strong performance enhancements to the Spark core while advancing Spark ML usability to use data frames. But what happens when you run Spark 2.0 machine learning algorithms on a large cluster with a very large data set? Do you even get any benefit from using a very large data set? It depends. How do new hardware advances affect the topology of high performance Spark clusters. In this talk we will explore Spark 2.0 Machine Learning at scale and share our findings with the community.
As our test platform we will be using a new cluster design, different from typical Hadoop clusters, with more cores, more RAM and latest generation NVMe SSD’s and a 100GbE network with a goal of more performance, in a more space and energy efficient footprint.
Automating Machine Learning - Is it feasible?Manuel Martín
Facing a machine learning problem for the first time can be overwhelming. Hundreds of methods exist for tackling problems such as classification, regression or clustering. Selecting the appropriate method is challenging, specially if no much prior knowledge is known. In addition, most models require to optimise a number of hyperparameters to perform well. Preparing the data for the learning algorithm is also a labour-intensive process that includes cleaning outliers and imperfections, feature selection, data transformation like PCA and more. A workflow connecting preprocessing methods and predictive models is called a multicomponent predictive system (MCPS). This talk introduces the problem of automating the composition and optimisation of MCPSs and also how they can be adapted in changing environments.
A good foundation has been established for both data mining research and genuine
application based data mining. The current functionality of EMADS is limited
to classification and Meta-ARM. The research team is at present working towards
increasing the diversity of mining tasks that EMADS can address. There are many
directions in which the work can (and is being) taken forward. One interesting direction
is to build on the wealth of distributed data mining research that is currently
available and progress this in an MAS context. The research team are also enhancing
the system’s robustness so as to make it publicly available. It is hoped that once
the system is live other interested data mining practitioners will be prepared to contribute
algorithms and data.
Fostering Serendipity through Big Linked DataMuhammad Saleem
This document discusses fostering serendipity through linking large biomedical datasets. It linked over 30 billion triples from The Cancer Genome Atlas (TCGA) and over 23 million publications from PubMed. It developed an architecture called TopFed to continuously integrate new data through parallel querying. TopFed was evaluated against the FedX system and shown to have significantly better performance, with query runtimes over 75 times faster for some queries. A visualization interface was also created to explore the linked data.
The document discusses various methods and challenges for identifying compounds based on limited information such as mass, name, fingerprint, or spectral data. It describes searching public databases, calculating elemental compositions, comparing spectra, and predicting fragmentation patterns to identify molecules or narrow down candidates. Even with this information, unique identification can be difficult, and integration of additional data types may be needed.
This document describes the design and implementation of an integrated system called MPDB for the storage and analysis of metabolomics data. MPDB was created as a free open-source laboratory information system tailored for the metabolomics workflow. It includes tools for raw data cleanup, compound identification, peak alignment across samples, data normalization, and statistical analysis. The system pipeline allows users to efficiently store large amounts of analytical results and associated biological metadata, perform multi-sample analysis and data mining, and gain new biological insights from metabolomics experiments. As an example application, the document outlines a study analyzing the effects of nitrogen stress on the leaf metabolism of Populus trees using MPDB.
Similar to ARCHIVED: new version available. 2016 - METASPACE Training Course (20)
Microbial interaction
Microorganisms interacts with each other and can be physically associated with another organisms in a variety of ways.
One organism can be located on the surface of another organism as an ectobiont or located within another organism as endobiont.
Microbial interaction may be positive such as mutualism, proto-cooperation, commensalism or may be negative such as parasitism, predation or competition
Types of microbial interaction
Positive interaction: mutualism, proto-cooperation, commensalism
Negative interaction: Ammensalism (antagonism), parasitism, predation, competition
I. Mutualism:
It is defined as the relationship in which each organism in interaction gets benefits from association. It is an obligatory relationship in which mutualist and host are metabolically dependent on each other.
Mutualistic relationship is very specific where one member of association cannot be replaced by another species.
Mutualism require close physical contact between interacting organisms.
Relationship of mutualism allows organisms to exist in habitat that could not occupied by either species alone.
Mutualistic relationship between organisms allows them to act as a single organism.
Examples of mutualism:
i. Lichens:
Lichens are excellent example of mutualism.
They are the association of specific fungi and certain genus of algae. In lichen, fungal partner is called mycobiont and algal partner is called
II. Syntrophism:
It is an association in which the growth of one organism either depends on or improved by the substrate provided by another organism.
In syntrophism both organism in association gets benefits.
Compound A
Utilized by population 1
Compound B
Utilized by population 2
Compound C
utilized by both Population 1+2
Products
In this theoretical example of syntrophism, population 1 is able to utilize and metabolize compound A, forming compound B but cannot metabolize beyond compound B without co-operation of population 2. Population 2is unable to utilize compound A but it can metabolize compound B forming compound C. Then both population 1 and 2 are able to carry out metabolic reaction which leads to formation of end product that neither population could produce alone.
Examples of syntrophism:
i. Methanogenic ecosystem in sludge digester
Methane produced by methanogenic bacteria depends upon interspecies hydrogen transfer by other fermentative bacteria.
Anaerobic fermentative bacteria generate CO2 and H2 utilizing carbohydrates which is then utilized by methanogenic bacteria (Methanobacter) to produce methane.
ii. Lactobacillus arobinosus and Enterococcus faecalis:
In the minimal media, Lactobacillus arobinosus and Enterococcus faecalis are able to grow together but not alone.
The synergistic relationship between E. faecalis and L. arobinosus occurs in which E. faecalis require folic acid
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills MN
By harnessing the power of High Flux Vacuum Membrane Distillation, Travis Hills from MN envisions a future where clean and safe drinking water is accessible to all, regardless of geographical location or economic status.
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Sérgio Sacani
Wereport the study of a huge optical intraday flare on 2021 November 12 at 2 a.m. UT in the blazar OJ287. In the binary black hole model, it is associated with an impact of the secondary black hole on the accretion disk of the primary. Our multifrequency observing campaign was set up to search for such a signature of the impact based on a prediction made 8 yr earlier. The first I-band results of the flare have already been reported by Kishore et al. (2024). Here we combine these data with our monitoring in the R-band. There is a big change in the R–I spectral index by 1.0 ±0.1 between the normal background and the flare, suggesting a new component of radiation. The polarization variation during the rise of the flare suggests the same. The limits on the source size place it most reasonably in the jet of the secondary BH. We then ask why we have not seen this phenomenon before. We show that OJ287 was never before observed with sufficient sensitivity on the night when the flare should have happened according to the binary model. We also study the probability that this flare is just an oversized example of intraday variability using the Krakow data set of intense monitoring between 2015 and 2023. We find that the occurrence of a flare of this size and rapidity is unlikely. In machine-readable Tables 1 and 2, we give the full orbit-linked historical light curve of OJ287 as well as the dense monitoring sample of Krakow.
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
Anti-Universe And Emergent Gravity and the Dark UniverseSérgio Sacani
Recent theoretical progress indicates that spacetime and gravity emerge together from the entanglement structure of an underlying microscopic theory. These ideas are best understood in Anti-de Sitter space, where they rely on the area law for entanglement entropy. The extension to de Sitter space requires taking into account the entropy and temperature associated with the cosmological horizon. Using insights from string theory, black hole physics and quantum information theory we argue that the positive dark energy leads to a thermal volume law contribution to the entropy that overtakes the area law precisely at the cosmological horizon. Due to the competition between area and volume law entanglement the microscopic de Sitter states do not thermalise at sub-Hubble scales: they exhibit memory effects in the form of an entropy displacement caused by matter. The emergent laws of gravity contain an additional ‘dark’ gravitational force describing the ‘elastic’ response due to the entropy displacement. We derive an estimate of the strength of this extra force in terms of the baryonic mass, Newton’s constant and the Hubble acceleration scale a0 = cH0, and provide evidence for the fact that this additional ‘dark gravity force’ explains the observed phenomena in galaxies and clusters currently attributed to dark matter.
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...Scintica Instrumentation
Targeting Hsp90 and its pathogen Orthologs with Tethered Inhibitors as a Diagnostic and Therapeutic Strategy for cancer and infectious diseases with Dr. Timothy Haystead.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
3. Part 1: Theory
14:00-14:10 Welcome
14:10-14:15 Introduction into the METASPACE project
14:15-14:45 Metabolite annotation in HR imaging MS
14:45-15:00 Overview of the annotation engine
Coffee Break 15:00-15:30
Part 2: Tutorial
15:30-16:30 Step-by-step analysis of datasets provided in
advance, questions
- Data requirements: 10 min
- Upload UI: 5 min
- Webapp UI: 15 min
- Interpretation: 15 min
- Split into 2 groups: 5 min
- Export to imzML, ideally parallel sessions: 15 min
- SCiLS, FTICR
- EMBL, Orbitrap
Coffee Break 16:35-17:00
Part 3: Hands-on training
17:00-18:00 questions, data analysis
Agenda
5. What we hope you will learn today
● Ins and outs of metabolite annotation in HR imaging MS
● Bioinformatics we developed for this problem
○ Metabolite Signal Match (MSM) score
○ False Discovery Rate estimation
○ FDR-controlled annotation
● The online engine we implemented
○ How to prepare data for submission to our service
○ How to submit your data
○ How to view molecular annotations in our webapp
6. Project overview: slides on slideshare
Bioinformatics: slides on slideshare
Theodore Alexandrov (EMBL, UCSD, SCiLS)
8. Outline
● Inputs (data and metadata)
● Online Software
● Data Submission
● Annotation Browsing
● Use Cases
a. mouse brain, MALDI-FTICR (UoR1)
b. human colorectal tumor, DESI-Orbitrap (ICL)
18. Part 1: Theory
14:00-14:10 Welcome, Outline, Learning objectives
14:10-14:15 Introduction into the METASPACE project
14:15-14:45 Metabolite annotation in HR imaging MS
14:45-15:00 Overview of the annotation engine
Coffee Break 15:00-15:30
Part 2: Tutorial
15:30-16:30 Step-by-step analysis of datasets provided in
advance, questions
- Data requirements: 10 min
- Upload UI: 5 min
- Webapp UI: 15 min
- Interpretation: 15 min
- Split into 2 groups: 5 min
- Export to imzML, ideally parallel sessions: 15 min
- SCiLS, FTICR
- EMBL, Orbitrap
Coffee Break 16:35-17:00
Part 3: Hands-on training
17:00-18:00 questions, data analysis
Agenda
22. Data Requirements
Data Format
- imzML
Centroided
- vendor preferred
- http://metasp.eu/imzml
http://imzml.org/wp/introduction/
23. Customised Processing
Processing is tailored to your data!
- Technical metadata
- Resolving power
- isotope prediction
- Polarity
- adducts
R200
=70K R200
=280K
[C41
H78
NO7
P+K]+
24. Data Requirements
Your responsibility:
- Data is processed ‘as is’
- Check metadata is correct
- Report resolving power accurately (check within data-set)
- Low numbers of annotations often correspond to poor quality mass spectra
- Calibration inaccuracy
- Lock-mass errors
26. 1. Follow conversion instructions for your instrument
2. Select the centroided files, imzML and ibd
3. Click the Upload button.
The dataset will be copied to the cloud storage
(accessible only to our team)
Data upload
27. Metadata form
● Appears once the upload is started
● Please fill truthfully
○ Most fields have ‘Other…’ option
○ Don’t want to disclose → put ‘-‘
● Click (at the very bottom)
30. Annotation table
Sign in with a Google ID to provide feedback
Currently selected
molecule
(click to select)
MSM scoreprincipal peak m/z
31. Sorting/filtering annotations
Click on column headers to sortStart typing a formula or a metabolite name
Filter by database
or dataset Select an adduct Set minimum
MSM score
Enter m/z of interest
33. Details for highlighted annotation
molecule distribution
(sum of isotope images)
Putative metabolite IDs
from the database
Feedback!
Thumbs up: reasonable
Thumbs down: dubious
- tell us why it could be wrong!
Feedback is not public
34. Visual insight into MSM score assignment
Adduct
Exact m/z of
each ion image
Zoom plot
Ion images for each
isotope peak
Isotopic patterns
Blue: theoretical abundance
(at instrument resolving power)
Red: measured image intensity
41. Results Browsing Summary
1. Choose database
2. Choose data-set
3. Type ‘PC’
a. molecular class filter
4. Type ‘PC(16:0/18:0)
a. single metabolite filter
5. Select row of table
a. single ion filter
6. Simple comparison of spatial distributions
between adducts
Also possible
● Filter by m/z
● Formula search
● Comparison across datasets
43. FDR Controlled Annotation
False Discovery Rate - the fraction of incorrect annotations
Control - request a set of annotations at a fixed estimated FDR
Setting the level:
- Adjust the number of molecules for follow-up analysis
- When only limited numbers of molecules can be reviewed, adjust the
FDR so that fewer/great numbers of molecules are annotated
- Compare annotations between datasets
- A principled way of selecting molecules to compare between
datasets
True annotation
False discovery
MSM
score
FDR = 0.1
FDR = 0.2
FDR = nTrue
nFalse
+ nTrue
44. Choice of metabolite database
synthesized/recorded
88M CAS registry
biologically occurring/active
50M PubChem compounds
single biological system
40K HMDB
sample specific
1K LC-MS
45. Choice of metabolite database
Impacts search and False-Discovery-Rate estimation
● Use one that’s relevant
● Larger database
○ more false-hits --> fewer annotations at a fixed FDR
● Different databases give different annotations
○ even for molecules in both databases due to FDR control
○ for data-set comparison, use the same database
46. Annotating at level of molecular formula
● Possibility of multiple metabolites per sum formula
○ webapp shows all hits from the database search (learn the ambiguity!)
○ other databases can be searched (e.g. PubChem)
○ use enrichment analysis to get biological leads
● Use an orthogonal technique for reporting individual metabolites
○ not directly integrated (yet)
○ use web-app results help to target MS/MS studies (e.g. purchase of standards)
47. ● we annotate molecular formula along with several putative metabolites
■ MSI Levels of classification:
1. identified metabolites
2. putatively annotated compounds
3. putatively characterised compound classes
4. unknown compounds
● In preparation: formal guidelines for reporting imaging mass spectrometry annotations
Guidelines for reporting
The role of reporting standards for metabolite annotation and identification in metabolomic studies,
Salek et al, 2013, gigascience
48. ● Preparing data for submission
○ imzML export
○ metadata
● Submitting data
○ upload web-app
upload.metasp.eu
● Browsing results
○ results web-app
alpha.metasp.eu
Learning Summary
● METASPACE team:
○ web: metaspace2020.eu
○ email: contact@metaspace2020.eu
○ twitter: @metaspace2020
○ github: github.com/spatialmetabolomics
● FTICR data conversion
○ SCiLS: support@scils.de
● Orbitrap data conversion
○ Thermo Fisher Scientific:
kerstin.strupat@thermofisher.com
How to get help?
50. (Group 1) Export into imzML: FT-ICR data
Using SCiLS Lab’s METASPACE export
51. Export to METASPACE
● Export your centroided high-resolution spectra in the imzML format
● Only available for “FT-ICR type” SCiLS Lab files in SCiLS Lab 2016b
● Best results in METASPACE if peak list is required for centroiding
● Two different Bruker data formats
○ SQLite peak list data: Peak list provided during import
○ FT-ICR profile data: Generate a peak list after import
52. Create imzML file for METASPACE
● In the objects tab, click the export symbol of
the region to be exported and select
“Export to METASPACE”
● The Export Spectra dialog opens
● Set your normalization of choice
● Select your peak list of choice
for example “Imported Peaks” in case of
SQLite
● Provide your scan polarity
● Click OK to save imzML file
53. SQLite peak list data
● Data must have been acquired with
on-the-fly centroid detection
i.e. there is a file called ‘peaks.sqlite’ within the .d
folder of the data-set
● In SCiLS Lab a peak list “Imported peaks” is
available, selecting most frequent peaks
By default all peaks appearing more frequently
than 1% of spectra
54. FT-ICR profile data
● Older Solarix Files do not directly contain a peak
list to perform centroiding
● Create peak list with Data Analysis
SCiLS Lab Help Section 7.4
● Use METASPACE tool for peak finding
https://spatialmetabolomics.github.io/centroidize/
● Use other external tools (mMass, …)
● Import the external peak list into SCiLS Lab
File > Import > m/z intervals from CSV or Clipboard
55. Use METASPACE tool for peak finding
● Select the overview spectrum CSV exported from SCiLS
● Upload CSV file to METASPACE tool
● Copy values to clipboard
● Use File > Import > m/z intervals from CSV
57. SCiLS Cloud: Exchange within the Scientific Community
1. SCiLS Lab: computational analysis
2. SCiLS Cloud: data & results can be
shared and viewed in web browser, e.g.,
○ MALDI imaging data,
○ Discriminative m/z markers,
○ Regions of interest, …
Comparison of mean spectra for ROIs m/z images of co-localized ions
58. Future Vision: SCiLS Cloud and METASPACE
SCiLS Lab
Statistical
analysis
METASPACE SCiLS Cloud
Upload data
and findings
Export data
to imzML
and upload
prospect: direct interface
59. (Group 2) Export into imzML: Orbitrap data (.raw)
Instructions: metaspace2020.eu/imzML
Software tools:
imageQuest / raw-converter
- Recommended for: MALDI images (Thermo MALDI- / TransMIT AP-S-MALDI-)
imzmlConverter
- Recommended for: DESI/flowProbe with separate files per row
Recommended for bioinformaticians: pyimzML (Python parser)
63. This project has received funding from the European Union’s Horizon 2020 research and
innovation programme under grant agreement № 634402.
Acknowledgments
Example data was provided by:
University of Rennes 1
Regis Lavigne
Charles Pineau
EMBL
Ksenija Radic
Alexandra Koumoutsi
Andrew Palmer
EMBL
Theodore Alexandrov
Vitaly Kovalev
Artem Tarasov
Andrew Palmer
Dominik Fay
SCiLS
Dennis Trede
Jan Hendrik Kobarg