This document provides an overview of protein identification using mass spectrometry. It discusses how tandem mass spectrometry is used to break proteins into peptides and peptides into fragment ions. The fragment ion masses are then used to reconstruct peptides through de novo sequencing or database searching against a protein database. The document compares de novo sequencing, which reconstructs peptides from fragment ion masses, to database searching, which matches experimental spectra to theoretical spectra from a database.
Text mining to produce large chemistry datasets for community accessValery Tkachenko
While in an ideal world all data would be deposited by the producing scientist directly into a database, in the real-world most chemical data is instead presented in a form designed for human rather than machine consumption. Text mining has the potential to extract this data back into a computer understandable form. As all United States patents are available free of charge they make the perfect corpus for extracting a large number of experimental properties of compounds, and chemical reactions.
We report on our text-mining activities to extract millions of textual NMR spectra, hundreds of thousands of physicochemical properties (with their associated compounds) and over a million chemical reactions. All extracted results are to be deposited into online databases allowing the community to benefit from the results of this work.
Using Mestrelab Research’s MNova product we have converted the textual NMR spectra to graphical spectra, and validated each spectrum against its associated chemical structure so as to detect cases where the NMR spectrum could not be produced by the associated structure.
In the case of melting points the resultant dataset, of over a quarter of a million melting compound/temperature relationships, is the largest public dataset the authors are aware of. We have used this dataset to produce a predictive model with results comparable to those of manually curated datasets. Our experiences with modelling this data has demonstrated that we are working at the edge of current algorithmic and computing capabilities for predictive model building, with the resultant matrix containing over 200 billion descriptors. The melting point model and the data it was derived from are available freely from http://www.ochem.eu.
Quantitative Analysis of Transporter Protein using TripleTOF® 6600 SystemSCIEX
Transport plays an important role in the absorption, distribution, and elimination of a variety of drugs.
In recent years, a large number of transporters, both efflux (ATP-binding cassette (ABC) family) and influx (solute carrier (SLC) family members) have been identified and well characterized in vitro.
However, the abundance of these transporters in the hepatocyte and cell lines as well as in the tissues such as intestine, liver, and kidney has not been accurately quantitated due to technical challenges.
This work aims to build a robust liquid chromatography-mass spectrometry (LC-MS) workflow on the SCIEX TripleTOF® 6600 platform to enable the quantitation of a variety of SLC and ABC drug transporters expressed in the hepatocyte and cell line plasma membranes.
Using text mining to inform genetic variant interpretationKarin Verspoor
There are ongoing large-scale efforts to catalog genomic variation related to disease in structured databases. Much of the relevant information is available only from unstructured sources, including the scientific literature. In our work, we have explored the ability of text mining tools to recover the mutations catalogued in curated databases based on the article text, specifically examining the recovery of mutations in the COSMIC and InSiGHT databases. We demonstrate that there are excellent tools for extraction of mutation mentions from the literature, but that the recovery of the information in databases is far less than what would be expected based on that tool performance, even when full text articles are available. I will present an analysis in which we explore the impact of processing tables and supplementary material associated to relevant literature, demonstrating that the coverage of variants improves dramatically, from 2% to over 50%. I will further present the Variome corpus, a small collection of full text publications annotated with relationships such as gene-disease and mutation-disease relationships, and introduce our recent efforts to develop strategies to extract this relational information from the literature. Joint work with Antonio Jimeno Yepes (IBM Research) and Min Song (Yonsei University).
The structure elucidation of natural product structures from analytical data, specifically NMR and MS, remains a major challenge. With an enormous palette of NMR experiments to choose from, and supported by breakthrough technologies in hardware, the generation of high quality data to enable even the most complex of natural product structures to be determined is no longer the major hurdle. The challenge is in the analysis of the data. We are in a new era in terms of approaches to structure elucidation: one where computers, databases, and a synergy between scientists and algorithms can offer an accelerated path forward. Software tools are capable of digesting spectroscopic data to elucidate extremely complex natural products. Scientists can now elucidate chemical structures utilizing multinuclear chemical shift data, correlation data from an array of 2D NMR experiments and utilize existing data sets for the purpose of dereplication and computer-assisted structure elucidation. With the explosion of online data especially, in public databases such as PubChem and ChemSpider, many tens of millions of chemical structures are available to seed fragment databases to include in the elucidation process. This presentation will provide an overview of how cheminformatics and chemical databases have been brought together to assist in the identification of natural products. It will include an examination of the state-of-the-art developments in Computer-Assisted Structure Elucidation.
The document discusses several topics related to protein structure prediction using Python:
1. It introduces the Chou-Fasman algorithm for predicting protein secondary structure from amino acid sequence. The algorithm calculates preference parameters for each amino acid to be in alpha helices, beta sheets, or other structures.
2. It provides an example of calculating helical propensity.
3. It lists the preference parameters output by the Chou-Fasman algorithm for each amino acid.
4. It outlines the steps of applying the Chou-Fasman algorithm to predict secondary structure elements in a protein sequence.
This document provides an overview of sequence alignment and scoring matrices. It defines key terms like identity, homology, orthologous, and paralogous genes. It discusses different types of scoring matrices including unitary matrices that score matches as 1 and mismatches as 0, and transition/transversion matrices that account for the higher likelihood of transitional mutations in nucleic acids. The document emphasizes that scoring matrices represent underlying evolutionary models and influence sequence analysis outcomes.
This document provides an overview of sequence alignment and scoring matrices. It defines key terms like identity, homology, orthologous, and paralogous genes. It discusses different types of scoring matrices, including unitary matrices that score matches as 1 and mismatches as 0, and transition/transversion matrices that account for the different likelihood of transition vs. transversion mutations in DNA. It explains that scoring matrices represent implicit models of evolution and influence sequence analysis outcomes. The document emphasizes that results depend critically on the chosen scoring matrix and model.
The document discusses various bioinformatics tools and algorithms for analyzing protein sequences, including Biopython for working with biological sequence data, the Kyte-Doolittle algorithm for predicting transmembrane regions, and the Chou-Fasman algorithm for predicting secondary structure from amino acid preferences for alpha helices, beta sheets, and random coils. It also provides examples of analyzing Swiss-Prot data to find properties of human proteins and applying these tools and libraries to extract insights from protein sequences.
Text mining to produce large chemistry datasets for community accessValery Tkachenko
While in an ideal world all data would be deposited by the producing scientist directly into a database, in the real-world most chemical data is instead presented in a form designed for human rather than machine consumption. Text mining has the potential to extract this data back into a computer understandable form. As all United States patents are available free of charge they make the perfect corpus for extracting a large number of experimental properties of compounds, and chemical reactions.
We report on our text-mining activities to extract millions of textual NMR spectra, hundreds of thousands of physicochemical properties (with their associated compounds) and over a million chemical reactions. All extracted results are to be deposited into online databases allowing the community to benefit from the results of this work.
Using Mestrelab Research’s MNova product we have converted the textual NMR spectra to graphical spectra, and validated each spectrum against its associated chemical structure so as to detect cases where the NMR spectrum could not be produced by the associated structure.
In the case of melting points the resultant dataset, of over a quarter of a million melting compound/temperature relationships, is the largest public dataset the authors are aware of. We have used this dataset to produce a predictive model with results comparable to those of manually curated datasets. Our experiences with modelling this data has demonstrated that we are working at the edge of current algorithmic and computing capabilities for predictive model building, with the resultant matrix containing over 200 billion descriptors. The melting point model and the data it was derived from are available freely from http://www.ochem.eu.
Quantitative Analysis of Transporter Protein using TripleTOF® 6600 SystemSCIEX
Transport plays an important role in the absorption, distribution, and elimination of a variety of drugs.
In recent years, a large number of transporters, both efflux (ATP-binding cassette (ABC) family) and influx (solute carrier (SLC) family members) have been identified and well characterized in vitro.
However, the abundance of these transporters in the hepatocyte and cell lines as well as in the tissues such as intestine, liver, and kidney has not been accurately quantitated due to technical challenges.
This work aims to build a robust liquid chromatography-mass spectrometry (LC-MS) workflow on the SCIEX TripleTOF® 6600 platform to enable the quantitation of a variety of SLC and ABC drug transporters expressed in the hepatocyte and cell line plasma membranes.
Using text mining to inform genetic variant interpretationKarin Verspoor
There are ongoing large-scale efforts to catalog genomic variation related to disease in structured databases. Much of the relevant information is available only from unstructured sources, including the scientific literature. In our work, we have explored the ability of text mining tools to recover the mutations catalogued in curated databases based on the article text, specifically examining the recovery of mutations in the COSMIC and InSiGHT databases. We demonstrate that there are excellent tools for extraction of mutation mentions from the literature, but that the recovery of the information in databases is far less than what would be expected based on that tool performance, even when full text articles are available. I will present an analysis in which we explore the impact of processing tables and supplementary material associated to relevant literature, demonstrating that the coverage of variants improves dramatically, from 2% to over 50%. I will further present the Variome corpus, a small collection of full text publications annotated with relationships such as gene-disease and mutation-disease relationships, and introduce our recent efforts to develop strategies to extract this relational information from the literature. Joint work with Antonio Jimeno Yepes (IBM Research) and Min Song (Yonsei University).
The structure elucidation of natural product structures from analytical data, specifically NMR and MS, remains a major challenge. With an enormous palette of NMR experiments to choose from, and supported by breakthrough technologies in hardware, the generation of high quality data to enable even the most complex of natural product structures to be determined is no longer the major hurdle. The challenge is in the analysis of the data. We are in a new era in terms of approaches to structure elucidation: one where computers, databases, and a synergy between scientists and algorithms can offer an accelerated path forward. Software tools are capable of digesting spectroscopic data to elucidate extremely complex natural products. Scientists can now elucidate chemical structures utilizing multinuclear chemical shift data, correlation data from an array of 2D NMR experiments and utilize existing data sets for the purpose of dereplication and computer-assisted structure elucidation. With the explosion of online data especially, in public databases such as PubChem and ChemSpider, many tens of millions of chemical structures are available to seed fragment databases to include in the elucidation process. This presentation will provide an overview of how cheminformatics and chemical databases have been brought together to assist in the identification of natural products. It will include an examination of the state-of-the-art developments in Computer-Assisted Structure Elucidation.
The document discusses several topics related to protein structure prediction using Python:
1. It introduces the Chou-Fasman algorithm for predicting protein secondary structure from amino acid sequence. The algorithm calculates preference parameters for each amino acid to be in alpha helices, beta sheets, or other structures.
2. It provides an example of calculating helical propensity.
3. It lists the preference parameters output by the Chou-Fasman algorithm for each amino acid.
4. It outlines the steps of applying the Chou-Fasman algorithm to predict secondary structure elements in a protein sequence.
This document provides an overview of sequence alignment and scoring matrices. It defines key terms like identity, homology, orthologous, and paralogous genes. It discusses different types of scoring matrices including unitary matrices that score matches as 1 and mismatches as 0, and transition/transversion matrices that account for the higher likelihood of transitional mutations in nucleic acids. The document emphasizes that scoring matrices represent underlying evolutionary models and influence sequence analysis outcomes.
This document provides an overview of sequence alignment and scoring matrices. It defines key terms like identity, homology, orthologous, and paralogous genes. It discusses different types of scoring matrices, including unitary matrices that score matches as 1 and mismatches as 0, and transition/transversion matrices that account for the different likelihood of transition vs. transversion mutations in DNA. It explains that scoring matrices represent implicit models of evolution and influence sequence analysis outcomes. The document emphasizes that results depend critically on the chosen scoring matrix and model.
The document discusses various bioinformatics tools and algorithms for analyzing protein sequences, including Biopython for working with biological sequence data, the Kyte-Doolittle algorithm for predicting transmembrane regions, and the Chou-Fasman algorithm for predicting secondary structure from amino acid preferences for alpha helices, beta sheets, and random coils. It also provides examples of analyzing Swiss-Prot data to find properties of human proteins and applying these tools and libraries to extract insights from protein sequences.
This document discusses different levels of protein structure from primary to quaternary structure. It explains that primary structure refers to the amino acid sequence of a protein. Secondary structure describes local folding patterns like alpha helices and beta sheets. Tertiary structure is the overall 3D shape of a single protein chain that results from folding. Quaternary structure involves the shape and interactions of multiple protein subunits. The document provides examples and diagrams to illustrate each level of structure and how they relate to determining a protein's function.
Increasingly online databases are being used for the purpose of structure identification. In many cases an unknown to an investigator is known in the chemical literature or online database and these “known unknowns” are commonly available in these aggregated internet resources. The identification of these types of compounds in commercial, environmental, forensic, and natural product samples can be identified by searching against these large aggregated databases querying by either elemental composition or monoisotopic mass. We will report on the search approaches that we offer on aggregated compound databases hosted by the Royal Society of Chemistry and how these resources can be used for the purpose of structure identification.
This document discusses various topics relating to protein structure and bioinformatics. It begins with an overview of protein structure and why understanding protein structure is important. It then discusses the different levels of protein structure from primary to quaternary structure. Methods for determining protein structure like X-ray crystallography and NMR are mentioned. Databases for storing protein structures like the Protein Data Bank are also summarized. The document touches on topics like protein folding, domains, membrane protein topology, and secondary structure prediction methods.
The Next, Next Generation of Sequencing - From Semiconductor to Single MoleculeJustin Johnson
The document discusses next generation sequencing technologies and challenges. It describes EdgeBio's sequencing platforms including Illumina, Ion Torrent, SOLiD, and PacBio machines. It highlights challenges such as experimental design considerations, flexibility with standards, sample preparation difficulties, and differences between platforms regarding read length, error rates, and yield. Overall the document provides an overview of sequencing technologies and issues researchers may face.
Abstract: Ontologies are used in numerous research disciplines and commercial applications to uniformly and semantically annotate real-world objects. Due to a rapid development of application domains the corresponding ontologies are changed frequently to include up-to-date knowledge. These changes dramatically influence dependent data as well as applications/systems, for instance, ontology mappings, that semantically interrelate ontologies. The talk will give an overview on evolution of ontologies and ontology-based mappings.
This document summarizes a presentation about the Cardiac Organellar Peptide Spectral Library (COPa) given by Rafael Jimenez on June 9, 2011. The COPa is a database of experimental peptide spectra from heart tissue of various species organized by organelle. It aims to catalogue spectra to serve as a reference for peptide identification searches. The presentation describes the consortium and teams involved in developing COPa, how it can be browsed and searched on the web and through desktop tools, and plans to expand COPa by integrating additional heart proteomics data from third parties.
This document provides an overview of databases, definitions, scoring matrices, and pairwise sequence alignment. It discusses major bioinformatics databases like NCBI, ExPASy, and EBI. It also defines key terms like identity, homology, orthologous, and paralogous sequences. Additionally, it examines the theoretical and empirical bases for scoring matrices like PAM, BLOSUM, and transition/transversion matrices, and how they are used in sequence alignment.
Whole-transcriptome profiling of bacterial samples can provide significant insights into mechanisms of prokaryotic metabolism. Unlike DNA profiling, it can also potentially discriminate between live and dead organisms in a mixed population, since RNA molecules have a shorter half-life than DNA molecules. Here, we describe how the Ion Torrent™ platform can be used to profile the transcriptome from E.coli and S.aureus cultures.
The Ion Total RNA-Seq kit and AB Library Builder™ were used for semi-automated library synthesis from Ribo-depleted RNA. Using the Partek Flow Pipeline for Ion Whole Transcriptome Analysis, we performed our alignment and analysis. We obtained between of 18-29 million reads per sample, allowing average coverage depth of around 500x for E.coli and around 1000x for S.aureus. Correlations of expression levels between replicates was excellent, with a Pearson Correlation Coefficient averaging greater than 0.97. between replicate samples. The top quartile of expressors had Pearson Correlation Coefficients of greater than 0.99 for both E.coli and S.aureus.These data demonstrate that the Ion Torrent Proton system provides an ideal workflow and capacity for sequencing and analysis of prokaryotic transcriptomes.
Use of spark for proteomic scoring seattle presentationlordjoe
This document discusses using Apache Spark to parallelize proteomic scoring, which involves matching tandem mass spectra against a large database of peptides. The author developed a version of the Comet scoring algorithm and implemented it on a Spark cluster. This outperformed single machines by over 10x, allowing searches that took 8 hours to be done in under 30 minutes. Key considerations for running large jobs in parallel on Spark are discussed, such as input formatting, accumulator functions for debugging, and smart partitioning of data. The performance improvements allow searching larger databases and considering more modifications.
This document discusses building an online profile as a scientist in the era of big data and open science. It begins with an overview of the speaker's background working in academia, industry, and as an entrepreneur. The speaker then discusses various online tools and platforms that scientists can use to share their work and expertise, such as ORCID, LinkedIn, Google Scholar, SlideShare, and ResearchGate. He emphasizes the importance of making contributions openly available online in order to increase visibility and measure impact through alternative metrics. The speaker also provides examples of using these tools to showcase his own career and publications.
This presentation gives an introduction to analysing ChIP-seq data and is part of a bioinformatics workshop. The accompanying websites are available at http://sschmeier.github.io/bioinf-workshop/#!galaxy-chipseq/
The document discusses various applications of luminescent molecular systems for sensing, computing, and identification. It describes sensors that can detect ions and pH and molecular logic gates that can perform Boolean operations. It also discusses using combinations of these systems fixed to beads to allow identification and parallel processing of large numbers of objects.
This document discusses ChemSpider, an online database of chemical compounds. It summarizes ChemSpider's capabilities, including searching by mass or formula for structure identification. ChemSpider contains over 34 million chemicals from various sources that can be searched and filtered. The document outlines how ChemSpider provides value to mass spectrometrists and discusses efforts to integrate more spectral data like NMR spectra directly into ChemSpider from publications and individual submissions. Future goals include hosting over a million spectra online and improving visualization of spectral data.
This document discusses using artificial neural networks to classify protein loops based on amino acid sequence. It provides background on protein structure, outlines challenges in protein structure prediction, and describes how neural networks like Hidden Layer Vector Quantization can be used to classify different types of protein loops from sequence alone with reasonable accuracy. The document concludes by discussing future work, including improved amino acid coding schemes and exploring protein structure information beyond multiple sequence alignments.
The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...InsideScientific
Methods matter. In metabolic measurement, confidence in reproducible results relies heavily on the design of the system used to acquire data. In the field of translational metabolic and behavioral phenotyping there is critical demand for more – throughput, standardization, synchronization of diverse data streams, temporal resolution, efficiency of workflow, and verification of results. We compare continuous and switched metabolic measurement methodologies and explore applications that benefit most from continuous measurement.
In this exclusive webinar sponsored by Sable Systems International, experts contrast methodologies and discuss how to improve best practices in metabolic phenotyping. We show how advances in high-bandwidth metabolic measurement, as implemented in Promethion metabolic phenotyping systems, leverage a 60- to 1200-fold increase in temporal resolution and achieve synchrony with intake and other behavioral data.
Key Topics:
* Time-saving methodologies for increasing throughput in multiplexed or continuous metabolic phenotyping
* Evaluation criteria for selecting a metabolic measurement system
* How the home-cage advantage of a pull-mode system reliably increases animal safety while dramatically reducing stress on both the animal and the researcher
* How to improve the resolution, accuracy and versatility of metabolic data using water vapor measurement
* The importance of raw data retention in metabolic phenotyping
* How deep data field format leads to greater traceability, improved reliability and far greater data extraction versatility to address research objectives
* How exact metabolic costs can be assigned to transient activities, with important implications for studies of energy balance, obesity, drug kinetics and metabolic diseases
Two-Tailed PCR - New Ultrasensitive and Ultraspecific Technique for the Quant...Kate Barlow
This document discusses methods for quantifying and analyzing microRNAs (miRNAs) using quantitative PCR (qPCR). It presents a new two-tailed RT-qPCR method that provides high sensitivity and specificity for detecting miRNAs, including discrimination of miRNA isoforms. The method allows unlimited multiplexing in the reverse transcription step followed by singleplex qPCR. The document benchmarks the two-tailed RT-qPCR method on biological samples, showing it can sensitively detect less than 10 molecules and maintain specificity across the entire miRNA sequence. It also demonstrates two-tube multiplexing of the method to profile expression levels of several miRNAs in different tissues.
This document discusses opportunities for developing novel compounds targeting medically relevant protein families using non-mainstream scaffolds. It notes 400 novel scaffolds in a European compound library that exhibit structural complexity, stereochemistry, scaffold diversity, and challenging chemistry. The document discusses opportunities for targeting G protein-coupled receptors, kinases, and protein-protein interactions using spirocyclic, anellated, and DNA-encoded scaffold libraries. It emphasizes the potential of these approaches for developing compounds with improved target residence time and selectivity profiles through disrupting a protein's hydrophobic spine formation.
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
The Cahn-Ingold-Prelog (CIP) priority rules have been the corner stone in written communication of stereo-chemical configuration for more than half a century. The rules rank ligands around a stereocentre allowing an atom order and layout invariant stereo-descriptor to be assigned, for example R (right) or S (left) for tetrahedral atoms. Despite their widespread daily use, many chemists may be surprised to find that beyond trivial cases, different software may assign different labels to the same structure diagram.
There have been several attempts to either replace or amend the CIP rules. This talk will highlight the more challenging aspects of the ranking and present a comparison of software that provide CIP labels and where they disagree. Providing an IUPAC verified free and open source CIP implementation would allow software maintainers and vendors to validate and improve their implementations. Ultimately this would improve the accuracy in exchange of written chemical information for all.
This document discusses different levels of protein structure from primary to quaternary structure. It explains that primary structure refers to the amino acid sequence of a protein. Secondary structure describes local folding patterns like alpha helices and beta sheets. Tertiary structure is the overall 3D shape of a single protein chain that results from folding. Quaternary structure involves the shape and interactions of multiple protein subunits. The document provides examples and diagrams to illustrate each level of structure and how they relate to determining a protein's function.
Increasingly online databases are being used for the purpose of structure identification. In many cases an unknown to an investigator is known in the chemical literature or online database and these “known unknowns” are commonly available in these aggregated internet resources. The identification of these types of compounds in commercial, environmental, forensic, and natural product samples can be identified by searching against these large aggregated databases querying by either elemental composition or monoisotopic mass. We will report on the search approaches that we offer on aggregated compound databases hosted by the Royal Society of Chemistry and how these resources can be used for the purpose of structure identification.
This document discusses various topics relating to protein structure and bioinformatics. It begins with an overview of protein structure and why understanding protein structure is important. It then discusses the different levels of protein structure from primary to quaternary structure. Methods for determining protein structure like X-ray crystallography and NMR are mentioned. Databases for storing protein structures like the Protein Data Bank are also summarized. The document touches on topics like protein folding, domains, membrane protein topology, and secondary structure prediction methods.
The Next, Next Generation of Sequencing - From Semiconductor to Single MoleculeJustin Johnson
The document discusses next generation sequencing technologies and challenges. It describes EdgeBio's sequencing platforms including Illumina, Ion Torrent, SOLiD, and PacBio machines. It highlights challenges such as experimental design considerations, flexibility with standards, sample preparation difficulties, and differences between platforms regarding read length, error rates, and yield. Overall the document provides an overview of sequencing technologies and issues researchers may face.
Abstract: Ontologies are used in numerous research disciplines and commercial applications to uniformly and semantically annotate real-world objects. Due to a rapid development of application domains the corresponding ontologies are changed frequently to include up-to-date knowledge. These changes dramatically influence dependent data as well as applications/systems, for instance, ontology mappings, that semantically interrelate ontologies. The talk will give an overview on evolution of ontologies and ontology-based mappings.
This document summarizes a presentation about the Cardiac Organellar Peptide Spectral Library (COPa) given by Rafael Jimenez on June 9, 2011. The COPa is a database of experimental peptide spectra from heart tissue of various species organized by organelle. It aims to catalogue spectra to serve as a reference for peptide identification searches. The presentation describes the consortium and teams involved in developing COPa, how it can be browsed and searched on the web and through desktop tools, and plans to expand COPa by integrating additional heart proteomics data from third parties.
This document provides an overview of databases, definitions, scoring matrices, and pairwise sequence alignment. It discusses major bioinformatics databases like NCBI, ExPASy, and EBI. It also defines key terms like identity, homology, orthologous, and paralogous sequences. Additionally, it examines the theoretical and empirical bases for scoring matrices like PAM, BLOSUM, and transition/transversion matrices, and how they are used in sequence alignment.
Whole-transcriptome profiling of bacterial samples can provide significant insights into mechanisms of prokaryotic metabolism. Unlike DNA profiling, it can also potentially discriminate between live and dead organisms in a mixed population, since RNA molecules have a shorter half-life than DNA molecules. Here, we describe how the Ion Torrent™ platform can be used to profile the transcriptome from E.coli and S.aureus cultures.
The Ion Total RNA-Seq kit and AB Library Builder™ were used for semi-automated library synthesis from Ribo-depleted RNA. Using the Partek Flow Pipeline for Ion Whole Transcriptome Analysis, we performed our alignment and analysis. We obtained between of 18-29 million reads per sample, allowing average coverage depth of around 500x for E.coli and around 1000x for S.aureus. Correlations of expression levels between replicates was excellent, with a Pearson Correlation Coefficient averaging greater than 0.97. between replicate samples. The top quartile of expressors had Pearson Correlation Coefficients of greater than 0.99 for both E.coli and S.aureus.These data demonstrate that the Ion Torrent Proton system provides an ideal workflow and capacity for sequencing and analysis of prokaryotic transcriptomes.
Use of spark for proteomic scoring seattle presentationlordjoe
This document discusses using Apache Spark to parallelize proteomic scoring, which involves matching tandem mass spectra against a large database of peptides. The author developed a version of the Comet scoring algorithm and implemented it on a Spark cluster. This outperformed single machines by over 10x, allowing searches that took 8 hours to be done in under 30 minutes. Key considerations for running large jobs in parallel on Spark are discussed, such as input formatting, accumulator functions for debugging, and smart partitioning of data. The performance improvements allow searching larger databases and considering more modifications.
This document discusses building an online profile as a scientist in the era of big data and open science. It begins with an overview of the speaker's background working in academia, industry, and as an entrepreneur. The speaker then discusses various online tools and platforms that scientists can use to share their work and expertise, such as ORCID, LinkedIn, Google Scholar, SlideShare, and ResearchGate. He emphasizes the importance of making contributions openly available online in order to increase visibility and measure impact through alternative metrics. The speaker also provides examples of using these tools to showcase his own career and publications.
This presentation gives an introduction to analysing ChIP-seq data and is part of a bioinformatics workshop. The accompanying websites are available at http://sschmeier.github.io/bioinf-workshop/#!galaxy-chipseq/
The document discusses various applications of luminescent molecular systems for sensing, computing, and identification. It describes sensors that can detect ions and pH and molecular logic gates that can perform Boolean operations. It also discusses using combinations of these systems fixed to beads to allow identification and parallel processing of large numbers of objects.
This document discusses ChemSpider, an online database of chemical compounds. It summarizes ChemSpider's capabilities, including searching by mass or formula for structure identification. ChemSpider contains over 34 million chemicals from various sources that can be searched and filtered. The document outlines how ChemSpider provides value to mass spectrometrists and discusses efforts to integrate more spectral data like NMR spectra directly into ChemSpider from publications and individual submissions. Future goals include hosting over a million spectra online and improving visualization of spectral data.
This document discusses using artificial neural networks to classify protein loops based on amino acid sequence. It provides background on protein structure, outlines challenges in protein structure prediction, and describes how neural networks like Hidden Layer Vector Quantization can be used to classify different types of protein loops from sequence alone with reasonable accuracy. The document concludes by discussing future work, including improved amino acid coding schemes and exploring protein structure information beyond multiple sequence alignments.
The Future of Metabolic Phenotyping Using data bandwidth to maximize N, analy...InsideScientific
Methods matter. In metabolic measurement, confidence in reproducible results relies heavily on the design of the system used to acquire data. In the field of translational metabolic and behavioral phenotyping there is critical demand for more – throughput, standardization, synchronization of diverse data streams, temporal resolution, efficiency of workflow, and verification of results. We compare continuous and switched metabolic measurement methodologies and explore applications that benefit most from continuous measurement.
In this exclusive webinar sponsored by Sable Systems International, experts contrast methodologies and discuss how to improve best practices in metabolic phenotyping. We show how advances in high-bandwidth metabolic measurement, as implemented in Promethion metabolic phenotyping systems, leverage a 60- to 1200-fold increase in temporal resolution and achieve synchrony with intake and other behavioral data.
Key Topics:
* Time-saving methodologies for increasing throughput in multiplexed or continuous metabolic phenotyping
* Evaluation criteria for selecting a metabolic measurement system
* How the home-cage advantage of a pull-mode system reliably increases animal safety while dramatically reducing stress on both the animal and the researcher
* How to improve the resolution, accuracy and versatility of metabolic data using water vapor measurement
* The importance of raw data retention in metabolic phenotyping
* How deep data field format leads to greater traceability, improved reliability and far greater data extraction versatility to address research objectives
* How exact metabolic costs can be assigned to transient activities, with important implications for studies of energy balance, obesity, drug kinetics and metabolic diseases
Two-Tailed PCR - New Ultrasensitive and Ultraspecific Technique for the Quant...Kate Barlow
This document discusses methods for quantifying and analyzing microRNAs (miRNAs) using quantitative PCR (qPCR). It presents a new two-tailed RT-qPCR method that provides high sensitivity and specificity for detecting miRNAs, including discrimination of miRNA isoforms. The method allows unlimited multiplexing in the reverse transcription step followed by singleplex qPCR. The document benchmarks the two-tailed RT-qPCR method on biological samples, showing it can sensitively detect less than 10 molecules and maintain specificity across the entire miRNA sequence. It also demonstrates two-tube multiplexing of the method to profile expression levels of several miRNAs in different tissues.
This document discusses opportunities for developing novel compounds targeting medically relevant protein families using non-mainstream scaffolds. It notes 400 novel scaffolds in a European compound library that exhibit structural complexity, stereochemistry, scaffold diversity, and challenging chemistry. The document discusses opportunities for targeting G protein-coupled receptors, kinases, and protein-protein interactions using spirocyclic, anellated, and DNA-encoded scaffold libraries. It emphasizes the potential of these approaches for developing compounds with improved target residence time and selectivity profiles through disrupting a protein's hydrophobic spine formation.
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
The Cahn-Ingold-Prelog (CIP) priority rules have been the corner stone in written communication of stereo-chemical configuration for more than half a century. The rules rank ligands around a stereocentre allowing an atom order and layout invariant stereo-descriptor to be assigned, for example R (right) or S (left) for tetrahedral atoms. Despite their widespread daily use, many chemists may be surprised to find that beyond trivial cases, different software may assign different labels to the same structure diagram.
There have been several attempts to either replace or amend the CIP rules. This talk will highlight the more challenging aspects of the ranking and present a comparison of software that provide CIP labels and where they disagree. Providing an IUPAC verified free and open source CIP implementation would allow software maintainers and vendors to validate and improve their implementations. Ultimately this would improve the accuracy in exchange of written chemical information for all.
The document describes string comparison techniques using matrix algebra and seaweed matrices. It introduces the concept of semi-local string comparison, which involves comparing a whole string to substrings of another string. The key idea is representing string comparison matrices implicitly using seaweed matrices, which represent unit-Monge matrices. This allows developing algebraic techniques for efficiently multiplying such matrices using the algebra of braids and the seaweed monoid. These multiplication techniques can then be applied to problems like dynamic programming string comparison and comparing compressed strings.
The document provides an overview of the KNIME analytics platform and its capabilities. It discusses:
- KNIME's origins, offices, codebase, and application areas including pharma, healthcare, finance, retail, and more.
- The key components of the KNIME platform including data access, transformation, analysis, visualization, and deployment capabilities.
- Integrations with tools like R, Weka, databases, and file formats.
- Community contributions expanding KNIME's functionality in areas like bioinformatics, chemistry, image processing, and more.
Ядерный век прошел, и становится все понятнее, что в фокусе науки 21-го века будут живые системы, медицина, и человек во всех его проявлениях. Здесь осуществляются самые масштабные финансовые вливания, и на эту отрасль человечество возлагает самые большие надежды. Все чаще слышатся предметные обсуждения тем, казавшихся еще недавно научной фантастикой: сможет ли человечество победить старение, рак, и другие смертельные заболевания? Сможет ли менять свой геном по собственному желанию? Будем ли мы хозяевами своим телам в той же мере, как мы хозяйничаем на Земле?
Многие десятилетия биология и медицина развивались как описательные науки. Однако по мере созревания и накопления информации, любая наука рано или поздно переходит на более точный язык - язык математики. Проект "Геном человека" обеспечил технологический прорыв, который будет питать науку о живом еще много лет - но который также поставил много новых глобальных вопросов перед современными учеными.
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...BioinformaticsInstitute
This document summarizes recent advances in cancer immunotherapy from the perspective of systems biology. It discusses how checkpoint blockade immunotherapy works by addressing the second co-inhibitory checkpoint signal needed for T cell activation. Computational methods are now able to identify tumor-specific neoantigens that can be targeted by immunotherapy. Mouse model studies showed that certain tumors are naturally rejected due to expression of a mutant antigen recognized by T cells, and that antigen-specific T cells are present before immunotherapy treatment. The high mutational load in melanoma makes it particularly responsive to checkpoint blockade. Early work in the 19th century by William Coley observed tumor regression following bacterial infection, which led to development of a toxin mixture that resembled modern vaccine formulations. Members of
http://bioinformaticsinstitute.ru/guests
В пятницу 10 октября в 19.00 Мария Шутова (ИоГЕН РАН) выступала в Институте биоинформатики с открытой лекцией, посвященной изучению рака.
Рак -- одна из наиболее распространенных причин смерти по всему миру. В лекции рассматривается, как знания об эволюции, работе генома, репрограммировании, а также использование биоинформатических методов помогли лучше понять, как развивается раковая опухоль и предложить новые методы лечения разнообразных типов рака. Рассмотрены мышиные модели развития рака и интересные результаты, которые были получены с их помощью.
http://bioinformaticsinstitute.ru/lectures
Гостевая лекция Института биоинформатики, 9 октября 2014. Лектор -- Мария Шутова (ИоГЕН РАН).
За последние десять лет плюрипонтентные клетки стали героями двух Нобелевских премий и многих тысяч научных и научно-популярных статей. Их уникальная возможность превращаться в любую клетку взрослого организма до сих пор дает пищу для ума как биологам развития, так и ученым, ищущим способы лечения генетических заболеваний. В лекции будет рассказано о двух типах плюрипотентных клеток: "естественных" (эмбриональные стволовые клетки) и "искусственных" (индуцированные плюрипотентные стволовые клетки). Отдельно мы остановимся на том, как знания о работе транскрипционных факторов помогли репрограммировать клетки, и как эти "искусственные" плюрипотентные клетки можно использовать в медицине.
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...BioinformaticsInstitute
This document summarizes genetic analyses of complex human phenotypes. It describes whole genome sequencing of individuals from bipolar disorder families and finding an association between genetic variation in a chromosome 6 region and amygdala volume. It also discusses rare variant sequencing of metabolic syndrome-related genes in Finnish cohorts, identifying new signals beyond existing GWAS hits. Additionally, it outlines exome and targeted sequencing of Tourette syndrome pedigrees, with a genome-wide significant result in a long non-coding RNA gene linked to the trait.
В своей лекции Андрей Афанасьев рассказал о стартапах в биотехе и биоинформатике и своем биоинформатическом проекте iBinom, разобрал несколько биотехнологических проектов глазами инноваторов и инвесторов, а также коснулся вопроса поиска инвестиций и поделился личным опытом взаимодействия с венчурными фондами и институтами развития.
This document provides an overview of the ENCODE project and how its data can be accessed through the UCSC Genome Browser. It discusses the different types of ENCODE data available, including mapping data, gene annotations, expression data, regulatory information, and genetic variation. It also explains how to find, view, and download ENCODE tracks from the Genome Browser and where to get more information about ENCODE. The overall goal of the ENCODE project is to identify all functional elements in the human genome.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Essentials of Automations: Exploring Attributes & Automation ParametersSafe Software
Building automations in FME Flow can save time, money, and help businesses scale by eliminating data silos and providing data to stakeholders in real-time. One essential component to orchestrating complex automations is the use of attributes & automation parameters (both formerly known as “keys”). In fact, it’s unlikely you’ll ever build an Automation without using these components, but what exactly are they?
Attributes & automation parameters enable the automation author to pass data values from one automation component to the next. During this webinar, our FME Flow Specialists will cover leveraging the three types of these output attributes & parameters in FME Flow: Event, Custom, and Automation. As a bonus, they’ll also be making use of the Split-Merge Block functionality.
You’ll leave this webinar with a better understanding of how to maximize the potential of automations by making use of attributes & automation parameters, with the ultimate goal of setting your enterprise integration workflows up on autopilot.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving
Manufacturing custom quality metal nameplates and badges involves several standard operations. Processes include sheet prep, lithography, screening, coating, punch press and inspection. All decoration is completed in the flat sheet with adhesive and tooling operations following. The possibilities for creating unique durable nameplates are endless. How will you create your brand identity? We can help!
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
2. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Outline
• Tandem Mass Spectrometry
• De Novo Peptide Sequencing
• Spectrum Graph
• Protein Identificationvia Database Search
• IdentifyingPost Translationally Modified Peptides
• Spectral Convolution
• Spectral Alignment
3. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Different Amino Acid Have Different Masses
H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH
Ri-1 Ri Ri+1
AA residuei-1 AA residuei AA residuei+1
N-terminus C-terminus
4. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Fragmentation
• Peptides tend to fragment along the backbone.
• Mass spectrometer is a sophisticated
(and rather expensive!) scale to measure
the masses of these fragments
H...-HN-CH-CO . . . NH-CH-CO-NH-CH-CO-…OH
Ri-1 Ri Ri+1
H+
Prefix Fragment Suffix Fragment
Collision Induced Dissociation
5. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Breaking Protein into Peptides and
Peptides into Fragment Ions
• Most mass spectrometers can only measure masses of
short peptides (e.g., 20 amino acids or less) rather than
masses of entire proteins (usually hundreds of amino
acids). That‟s why:
• Proteases, e.g. trypsin, break protein into short peptides.
• A Tandem Mass Spectrometer further breaks the peptides
down into fragment ions and measures the mass of each
piece.
• Mass Spectrometer accelerates the fragmented ions;
heavier ions accelerate slower than lighter ones.
• Mass Spectrometer measure mass/charge ratio of an ion.
6. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
N- and C-terminal Peptides
7. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Terminal peptides and ion types
Peptide
Mass (D) 57 + 97 + 147 + 114 = 415
8. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Masses of fragment ions
Peptide
Mass (D) 57 + 97 + 147 + 114 = 415
Peptide
Mass (D) 57 + 97 + 147 + 114 – 18 = 397
without
9. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
N- and C-terminal Peptides
415
486
301
154
57
71
185
332
429
10. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
N- and C-terminal Peptides
415
486
301
154
57
71
185
332
429
11. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Theoretical Spectrum
415
486
301
154
57
71
185
332
429
Reconstruct peptide from the set of masses of fragment ions
(mass-spectrum)
57 71 154 185 301 332 415 429 486
12. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Reconstructing Peptides
Reconstruct peptide from the set of masses of fragment ions
(mass-spectrum)
57 71 154 185 301 332 415 429 486
13. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Reconstructing Peptides
Reconstruct peptide from the set of masses of fragment ions
(mass-spectrum)
57 71 81 100 112 131 160 172 177 185 201 221 235 301 312 325 332 370 387 409 415 423 429 460 472 486154
14. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Reconstructing Peptides
Reconstruct peptide from the set of masses of fragment ions
(mass-spectrum)
57 71 81 100 112 131 160 172 177 185 201 221 235 301 312 325 370 387 409 415 423 429 460 472 486
15. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Fragmentation
y3
b2
y2 y1
b3a2 a3
HO NH3
+
| |
R1 O R2 O R3 O R4
| || | || | || |
H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH
| | | | | | |
H H H H H H H
b2-H2O
y3 -H2O
b3- NH3
y2 - NH3
16. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Mass Spectra
G V D L K
mass
0
57 Da = „G‟ 99 Da = „V‟
LK D V G
• The peaks in the mass spectrum:
• Prefix
• Fragments with neutral losses (-H2O, -NH3)
• Noise and missing peaks.
and Suffix Fragments.
D
H2O
17. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Protein Identification with MS/MS
mass
0
Intensity
mass
0
MS/MS
Peptide
Identification
Protein database
18. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Tandem Mass Spectrum
• Tandem Mass Spectrometry mainly generates N-
and C-terminal fragment ions
• Chemical noise often complicates the spectrum.
• Represented in 2-D: mass/charge axis vs. intensity
axis
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
200 400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
RelativeAbundance
850.3
687.3
588.1
851.4
425.0
949.4
326.0
524.9
589.2
1048.6
397.1226.9
1049.6
489.1
629.0
19. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Tandem Mass-Spectrometry
20. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Breaking Proteins into Peptides
peptides
MPSER
……
GTDIMR
PAKID
……
HPLC
To
MS/MSMPSERGTDIMRPAKID......
protein
21. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Mass Spectrometry
Matrix-Assisted Laser Desorption/Ionization (MALDI)
From lectures by Vineet Bafna (UCSD)
23. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Protein Identification by Tandem
Mass Spectrometry (MS/MS)
S
e
q
u
e
n
c
e
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
200 400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
RelativeAbundance
850.3
687.3
588.1
851.4
425.0
949.4
326.0
524.9
589.2
1048.6
397.1226.9
1049.6
489.1
629.0
MS/MS instrument
database search
•Sequest, Mascot, etc
de novo interpretation
•Lutefisk, Peaks, etc
24. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
De Novo vs. Database Search
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
200 400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
RelativeAbundance
850.3
687.3
588.1
851.4
425.0
949.4
326.0
524.9
589.2
1048.6
397.1226.9
1049.6
489.1
629.0
W
R
A
C
V
G
E
K
DW
L
P
T
L T
W
R
A
C
V
G
E
K
DW
L
P
T
L T
De Novo
AVGELTK
Database
Search
Database of all peptides = 20n
AAAAAAAA,AAAAAAAC,AAAAAAAD,AAAAAAAE,
AAAAAAAG,AAAAAAAF,AAAAAAAH,AAAAAAI,
AVGELTI, AVGELTK , AVGELTL, AVGELTM,
YYYYYYYS,YYYYYYYT,YYYYYYYV,YYYYYYYY
Database of
known peptides
MDERHILNM, KLQWVCSDL,
PTYWASDL, ENQIKRSACVM,
TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA,
GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, AVGELTK,
HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
Database of
known peptides
MDERHILNM, KLQWVCSDL,
PTYWASDL, ENQIKRSACVM,
TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA,
GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, AVGELTK,
HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
Mass, Score
25. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
De Novo vs. Database Search: A Paradox
• The database of all peptides is huge ≈ 20n peptides of length n
• The database of all known peptides is much smaller ≈ 108 peptides
• However, de novo algorithms can be much faster, even though their
search space is much larger!
• A database search scans all peptides in the database of all known
peptides to find best one.
• De novo eliminates the need to scan database of all peptides by
modeling the problem as a graph search.
26. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Three Algorithmic Problems
• Searching for a million words in a text. Suppose it takes
1 sec to find a word in a text. How much time would it take
to find 1 million words in the text?
• Searching for a word without even looking at 99.999%
of the text. Suppose you search for a word in a text.
Would it be possible to ignore 99.999% of the text, scan
only the remaining part and guarantee that the word you
are looking for will be found?
• Finding spelling errors in a book written in an
unknown language. Given a book (in an unknown
language) and a misspelled word (with insertions,
deletions, and substitutions of letters) correct spelling
errors in the word.
27. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Three Algorithmic Problems
• Searching for a million words in a text. Suppose it takes
1 sec to find a word in a text. How much time would it take
to find 1 million words in the text? 1 millionseconds?
• Searching for a word without even looking at 99.999%
of the text. Suppose you search for a word in a text.
Would it be possible to ignore 99.999% of the text, scan
only the remaining part and guarantee that the word you
are looking for will be found?
• Finding spelling errors in a book written in an
unknown language. Given a book (in an unknown
language) and a misspelled word (with insertions,
deletions, and substitutions of letters) correct spelling
errors in the word.
28. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Genomics: Problems Solved.
• Searching for a million words in a text.
Aho-Corasik algorithm takes roughly the same time
with a million words as it takes with a single word.
• Searching for a word without even looking at
99.999% of the text.
Filtration algorithms (like FASTA or BLAST) ignore
99.999% of the text.
• Finding spelling errors.
Sequence alignment algorithms (like Smith-
Waterman) do it in quadratic time
29. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Proteomics: Three Problems
• Comparing a million spectra against a database. Suppose it
takes 1 sec to interpret a spectrum. How much time would it take to
interpret 1 million spectra?
• Mass-spectrometry database search without even looking at
99.999% of the database. Suppose you compare a spectrum
against a database. Would it be possible to ignore 99.999% of the
database, scan only the remaining part and guarantee that you still
can identify a peptide of interest?
• Blind PTM search and discovery of new PTM types. Given a
spectrum of a peptide with unknown PTM types, find this peptide in
the database. Discover new PTM types by data mining of large
MS/MS datasets.
30. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Three Solutions
• Comparing a million spectra against a database.
InsPecT (Tanner et al., Anal. Chem, 2005)
• MS/MS database search without even looking at
99.999% of the database.
InsPecT (Tanner et al., Anal. Chem, 2005)
• Blind PTM search and discovery of new PTM types. Given a
spectrum of a peptide with unknown PTM types, find this
peptide in the database. Discover new PTM types by data
mining of large MS/MS datasets.
MS-Alignment (Tsur et al., Nature Biotech., 2005)
31. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Filtration: Combining De Novo Sequencing and
Database Search in Mass-Spectrometry
• So far de novo and database search were presented as
two separate techniques
• Database search is rather slow: many labs generate
more than 100,000 spectra per day. SEQUEST takes
approximately 1 minute to compare a single spectrum
against SWISS-PROT (54Mb) on a desktop.
• It will take SEQUEST more than 2 months to analyze the
MS/MS data produced in a single day.
• Can slow database search be combined with fast de
novo analysis?
33. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Building Spectrum Graph
• How to create vertices (from masses)
• How to create edges (from mass differences)
• How to score vertices
• How to score paths
• How to find the best path
34. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
S E Q U E N C E
b-ions (prefix or N-terminal ions)
Mass/Charge (M/Z)
35. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
a-ions = b-ions - CO = b-ions - 28
Mass/Charge (M/Z)
S E Q U E N C E
36. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
S E Q U E N C E
Mass/Charge (M/Z)
Shifting Peaks: a-ions = b-ions - CO = b-ions - 28
37. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
y-ions (suffix of C-terminal ions)
Mass/Charge (M/Z)
E C N E U Q E S
38. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Mass/Charge (M/Z)
Intensity
39. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Mass/Charge (M/Z)
Intensity
40. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
noise
Mass/Charge (M/Z)
41. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
MS/MS Spectrum
Mass/Charge (M/z)
Intensity
42. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Some Mass Differences between Peaks
Correspond to Amino Acids
s
s
s
e
e
e
e
e
e
e
e
q
q
qu
u
u
n
n
n
e
c
c
c
43. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Some Mass Differences between Peaks
Correspond to Amino Acids
s
s
s
e
e
e
e
e
e
e
e
q
q
qu
u
u
n
n
n
e
c
c
c
44. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Ion Types
• Some masses correspond to fragment ions,
others are just random noise
• Knowing ion types Δ={δ1, δ2,…, δk} allows one to
distinguish fragment ions from noise
• We can learn ion types δi and their probabilities qi
by analyzing a large test sample of annotated
spectra.
45. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Example of Ion Type
• Δ={δ1, δ2,…, δk}
• Ion types
{b, b-NH3, b-H2O, b-CO}
correspond to
Δ={0, 17, 18, 28}
*Note: In reality the δ value of ion type b is -1 but we will “hide” it for the sake of simplicity
46. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Match between Spectra and the
Shared Peak Count
• The match between two spectra is the number of masses (peaks)
they share (Shared Peak Count or SPC)
• In practice mass-spectrometrists use the weighted SPC that
reflects intensities of the peaks
• Match between experimental and theoretical spectra is defined
similarly
47. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Sequencing Problem
Goal: Find a peptide with maximal match between
an experimental and theoretical spectrum.
Input:
• S: experimental spectrum
• Δ: set of possible ion types
Output:
• A peptide whose theoretical spectrum
matches the experimental spectrum the best
48. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
S E Q U E N C E
Mass/Charge (M/Z)
Shifting Peaks: a-ions = b-ions - CO = b-ions - 28
49. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Reverse Shifts
Shift in H2O+NH3
Shift in H2O
50. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Vertices of the Spectrum Graph
• Masses of potential N-terminalpeptides
• Vertices aregenerated by reverseshifts correspondingto ion types
Δ={δ1, δ2,…, δk}
• Every N-terminalpeptidecan generateup to k ions
m-δ1, m-δ2, …, m-δk
• Every mass s in an MS/MS spectrumgenerates k vertices
V(s) = {s+δ1, s+δ2, …, s+δk}
correspondingto potentialN-terminalpeptides
• Vertices of the spectrum graph:
{initialvertex} V(s1) V(s2) ... V(sm) {terminalvertex}
51. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Edges of the Spectrum Graph
• Two vertices with mass difference corresponding to
an amino acid A:
• Connect with an edge labeled by A
• Gap edges corresponding to the mass of pairs of
amino acids (optional)
52. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Paths
• Paths in the labeled graph spell out amino
acid sequences
• There are many paths, how to find the correct
one?
• We need scoring to evaluate paths
53. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Path Score
• p(P,S) = probability that peptide P produces
spectrum S= {s1,s2,…sq}
• p(P, s) = the probability that peptide P
generates a peak s
• Scoring = computing probabilities
• p(P,S) = πsєS p(P, s)
54. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
• For a position t that represents ion type dj :
qj, if peak is generated at t
p(P,st) =
1-qj , otherwise
Peak Score
55. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peak Score (cont’d)
• For a position t that is not associated with an
ion type:
qR , if peak is generated at t
pR(P,st) =
1-qR , otherwise
• qR = the probability of a noisy peak that does
not correspond to any ion type
56. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Finding Optimal Paths in the Spectrum Graph
• For a given MS/MS spectrum S, find a
peptide P’ maximizing p(P,S) over all
possible peptides P:
• Peptides = paths in the spectrum graph
• P’ = the optimal path in the spectrum graph
p(P,S)p(P',S) Pmax
57. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Ions and Probabilities
• Tandem mass spectrometry is characterized
by a set of ion types {δ1,δ2,..,δk} and their
probabilities {q1,...,qk}
• δi-ions of a partial peptide are produced
independently with probabilities qi
58. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Ions and Probabilities
• A peptide has all k peaks with probability
• and no peaks with probability
• A peptide also produces a ``random noise''
with uniform probability qR in any position.
k
i
iq
1
k
i
iq
1
)1(
59. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Ratio Test Scoring for Partial Peptides
• Incorporates premiums for observed ions
and penalties for missing ions.
• Example: for k=4, assume that for a partial
peptide P‟ we only see ions δ1,δ2,δ4.
The score is calculated as:
RRRR q
q
q
q
q
q
q
q 4321
)1(
)1(
60. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Scoring Peptides
• T- set of all positions.
• Ti={t δ1,, t δ2,..., ,t δk,}- set of positions that
represent ions of partial peptides Pi.
• A peak at position tδj is generated with
probability qj.
• R=T- U Ti - set of positions that are not
associated with any partial peptides (noise).
61. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Probabilistic Model
• For a position t δj Ti the probability p(t, P,S) that
peptide P produces a peak at position t.
• Similarly, for t R, the probability that P produces a
random noise peak at t is:
otherwise1
position tatgeneratedispeakaif
),,(
j
j
j
q
q
SPtP
otherwise1
position tatgeneratedispeakaif
)(
R
R
R
q
q
tP
62. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Probabilistic Score
• For a peptide P with n amino acids, the score
for the whole peptides is expressed by the
following ratio test:
n
i
k
j iR
i
R j
j
tp
SPtp
Sp
SPp
1 1 )(
),,(
)(
),(
63. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
De Novo vs. Database Search
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6
T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
200 400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
RelativeAbundance
850.3
687.3
588.1
851.4
425.0
949.4
326.0
524.9
589.2
1048.6
397.1226.9
1049.6
489.1
629.0
W
R
A
C
V
G
E
K
DW
L
P
T
L T
W
R
A
C
V
G
E
K
DW
L
P
T
L T
De Novo
AVGELTK
Database
Search
Database of
known peptides
MDERHILNM, KLQWVCSDL,
PTYWASDL, ENQIKRSACVM,
TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA,
GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, AVGELTK,
HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
Database of
known peptides
MDERHILNM, KLQWVCSDL,
PTYWASDL, ENQIKRSACVM,
TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA,
GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, AVGELTK,
HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
64. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
De Novo vs. Database Search: A Paradox
• de novo algorithms are much faster, even though their
search space is much larger!
• A database search scans all peptides to find the best one.
• De novo eliminates the need to scan all peptides by
modeling the problem as a graph search.
Why not sequence de novo?
65. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Why Not Sequence De Novo?
• De novo sequencing is not very accurate:
• Less than 30% of the peptides sequenced
were completely correct!
Algorithm Amino
Acid
Accuracy
Whole Peptide
Accuracy
Lutefisk, 1997 0.566 0.189
SHERENGA, 1999 0.690 0.289
Peaks, 2003 0.673 0.246
PepNovo, 2005 0.727 0.296
66. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Pros and Cons of de novo Sequencing
• Advantage:
• Gets the sequences that are not necessarily in the
database.
• An additional similarity search step using these sequences
may identify the related proteins in the database.
• Disadvantage:
• Requires higher quality spectra to be accurate.
67. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Sequencing Problem
Goal: Find a peptide with maximal match between
an experimental and theoretical spectrum.
Input:
• S: experimental spectrum
• Δ: set of possible ion types
Output:
• A peptide whose theoretical spectrum
matches the experimental S spectrum the best
68. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Identification Problem
Goal: Find a peptide from the database with
maximal match between an experimental and
theoretical spectrum.
Input:
• S: experimental spectrum
• database of peptides
• Δ: set of possible ion types
Output:
• A peptide from the database whose theoretical
spectrum matches the experimental S
spectrum the best
69. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
MS/MS Database Search
Database search in mass-spectrometry has been
successful in identification of already known proteins.
Experimental spectrum can be compared with theoretical
spectra of database peptides to find the best fit.
SEQUEST (Yates et al., 1995)
But reliable algorithms for identification of modified
peptides is a much more difficult problem.
70. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
The dynamic nature of the proteome
• The proteome of the cell
is changing
• Various extra-cellular,
and other signals
activate various protein
pathways.
• A key mechanism of
protein activation is
post-translational
modification (PTM)
• These pathways may
lead to other genes
being switched on or off
• Mass spectrometry is
key to probing the
proteome and detecting
PTMs
71. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Post-Translational Modifications
Proteins are involved in cellular signaling and
metabolic regulation.
They are subject to a large number of biological
modifications.
Most protein sequences are post-translationally
modified and 600 types (!) of modifications of
amino acid residues are known.
72. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Examples of Post-Translational
Modification
Post-translational modifications increase the number of “letters” in
amino acid alphabet and lead to a combinatorial explosion in both
database search and de novo approaches.
73. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Masses
415
486
301
154
57
71
185
332
429
74. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Masses: Modification +16 on N
415
486
301
154
57
71
185
332
429
+16
+16
+16
+16
+16
+16
75. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Reconstructing Modified Peptides
Reconstruct peptide from the set of masses of fragment ions
(mass-spectrum)
57 71 154 185 301 332 415 429 486
+16 +16 +16 +16 +16
57 71 154 201 301 348 431 445 502
76. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Reconstructing Modified Peptides
Reconstruct peptide from the set of masses of fragment ions
(mass-spectrum)
57 71 81 100 112 131 154 160 172 177 185 201 221 235 301 312 325 332 370 387 409 415 423 429 460 472 486
57 71 81 100 112 131 154 160 172 177 185 201 221 235 301 312 325 332 348 387 409 415 423 431 445 472 502
77. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Identification Problem Revisited
Goal: Find a peptide from the database with
maximal match between an experimental and
theoretical spectrum.
Input:
• S: experimental spectrum
• database of peptides
• Δ: set of possible ion types
Output:
• A peptide from the database whose theoretical
spectrum matches the experimental S
spectrum the best
78. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Modified Peptide Identification Problem
Goal: Find a modified peptide from the database with maximal
match between an experimental and theoretical spectrum.
Input:
• S: experimental spectrum
• database of peptides
• Δ: set of possible ion types
• Parameter k (# of mutations/modifications)
Output:
• A peptide that is at most k mutations/modifications apart
from a database peptide and whose theoretical spectrum
matches the experimental S spectrum the best
79. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Search for Modified Peptides:
Virtual Database Approach
• YFDSTDYNMAK
• 25=32 possibilities,
with
2 types of
modifications!
Phosphorylation?
Oxidation?
Yates et al.,1995: an exhaustive search in a
virtual database of all modified peptides.
Combinatorial explosion, even for a small
set of modifications types.
A larger set of spurious matches must be
filtered out. It‟s much more likely that
incorrect matches will have high scores.
Problem. Extend the virtual database
approach to a large set of modifications.
80. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Restrictive vs Unrestrictive (Blind)
Search for Modified Peptides
• Restrictive search requires the researcher to
guess which modification types are present in
the sample
• How would you feel if you were allowed to
perform only 10 out of 20x19=380 possible
point mutations (with all other forbidden)
while aligning two proteins?
• This is exactly what the restrictive PTM
search algorithms do.
81. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Restrictive vs Unrestrictive (Blind) Search
for Modified Peptides
• Restrictive search requires the researcher to guess
which modification types are present in the sample
• Blind search performs an unrestrictive search for all
possible modification offsets at once.
82. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Database Search:
Sequence Analysis vs. MS/MS Analysis
Sequence analysis:
similar peptides (that a few mutations apart) have similar sequences
MS/MS analysis:
similar peptides (that a few mutations apart) have dissimilar spectra
83. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Identification Problem: Challenge
Very similar peptides may have very different
spectra!
Goal: Define a notion of spectral similarity that
correlates well with the sequence similarity.
If peptides are a few mutations/modifications
apart, the spectral similarity between their
spectra should be high.
84. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Deficiency of the Shared Peaks Count
Shared peaks count (SPC): intuitive measure
of spectral similarity.
Problem: SPC diminishes very quickly as the
number of mutations increases.
Only a small portion of correlations between
the spectra of mutated peptides is captured
by SPC.
86. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Convolution
)0)((
))((
,
12
12
122211
22111212
:
SS
xSS
ssSsSs
}S,sS:ss{sSS
x
:peak)(SPCcountpeakssharedThe
withpairsofNumber
87. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Elements of S2 S1 represented as elements of a difference matrix. The
elements with multiplicity >2 are colored; the elements with multiplicity =2
are circled. The SPC takes into account only the red entries
88. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
1
2
3
4
5
0
-150 -100 -50 0 50 100
150
x
Spectral Convolution: An Example
89. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Comparison: Difficult Case
S = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}
Which of the spectra
S’ = {10, 20, 30, 40, 50, 55, 65, 75,85, 95}
or
S” = {10, 15, 30, 35, 50, 55, 70, 75, 90, 95}
fits the spectrum S the best?
SPC: both S’ and S” have 5 peaks in common with S.
Spectral Convolution: reveals the peaks at 0 and 5.
90. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Comparison: Difficult Case
S S’
S S’’
91. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Limitations of the Spectrum Convolutions
Spectral convolution does not reveal that
spectra S and S’ are similar, while spectra S
and S” are not.
Clumps of shared peaks: the matching
positions in S’ come in clumps while the
matching positions in S” don't.
This important property was not captured by
spectral convolution.
92. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Sequence Alignment=Path in a Grid
A R N G A L R
A 1 1
R 1 1
N 1
G 1
Z
A 1 1
L 1
R 1 1
Finding similarities between
two peptides
is equivalent to finding an optimal path in
a Manhattan-like grid (sequence
alignment).
93. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Sequence Alignment=Path in a Grid
A R N G A L R
A 1 1
R 1 1
N 1
G 1
Z
A 1 1
L 1
R 1 1
Finding similarities between
two peptides
is equivalent to finding an optimal path in
a Manhattan-like grid (sequence
alignment). Every horizontal/vertical
segment in this path corresponds to
insertion/deletion of an amino acid.
94. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Sequence Alignment=Path in a Grid
A R N G A L R
A 1 1
R 1 1
N 1
G 1
Z
A 1 1
L 1
R 1 1
Finding similarities between
two peptides
is equivalent to finding an optimal path in a
Manhattan-like grid (sequence alignment).
Every horizontal/vertical segment in this path
corresponds to insertion/deletion of an amino
acid.
Can we find similarities between
a spectrum and a peptide
using a similar approach (spectral alignment)?
95. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Converting Spectra into 0-1 Sequences
• Convert spectrum into a 0-1 string with 1s
corresponding to the positions of the peaks.
96. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Modified peptide
Modifications are modeled as insertion (or deletions)
of blocks of zeroes
000101001010000000110000001001 Spectrum
00010100001-----00010000001001 Peptide
A modification with positive offset - inserting a block of 0s
A modification with negative offset - deleting a block of 0s
97. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectra Comparison vs. String Comparison
• Comparison of theoretical and
experimental spectra (represented as 0-1
strings) corresponds to a (somewhat
unusual) edit distance/alignment
between 0-1 strings where elementary
edit operations are insertions and
deletions of blocks of 0s
• Use sequence alignment algorithms!
99. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Alignment Graph
0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
0
0
0
0
1 1 1 1 1 1
0
0
0
0
1 1 1 1 1 1
0
0
1 1 1 1 1 1
0
0
0
0
1 1 1 1 1 1
0
0
0
0
1 1 1 1 1 1
Like in alignment algorithms,
every path in the spectral
alignment graph represents a
possible interpretation of a
spectra.
A path covering maximal
number of 1s is the “best”
interpretation of the spectrum.
Vertical / horizontal segment
in the optimal path are
modifications
A
B
C
D
E
100. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Alignment vs. Sequence Alignment
• Alignment graph with different alphabet and
scoring.
• Movement can be diagonal (matching
masses) or horizontal/vertical
(insertions/deletions corresponding to PTMs).
• At most k horizontal/vertical moves.
101. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Shifts
A = {a1 < … < an} : an ordered set of natural
numbers.
A shift (i, ) is characterized by two parameters,
the position (i) and the length ( ).
The shift (i, ) transforms
{a1, …., an}
into
{a1, ….,ai-1,ai+ ,…,an+ }
102. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Shifts: An Example
The shift (i, ) transforms {a1, …., an}
into {a1, ….,ai-1,ai+ ,…,an+ }
e.g.
10 20 30 40 50 60 70 80 90
10 20 30 35 45 55 65 75 85
10 20 30 35 45 55 62 72 82
shift (4, -5)
shift (7,-3)
103. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Alignment Problem
• Find a series of k shifts that make the sets
A={a1, …., an} and B={b1,….,bn}
as similar as possible.
• k-similarity between sets
• D(k) - the maximum number of elements in
common between sets after k shifts.
104. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Comparing Spectra=Comparing 0-1 Strings
• A modification with positive offset corresponds to
inserting a block of 0s
• A modification with negative offset corresponds to
deleting a block of 0s
• Comparison of theoretical and experimental spectra
(represented as 0-1 strings) corresponds to a
(somewhat unusual) edit distance/alignment
problem where elementary edit operations are
insertions/deletions of blocks of 0s
• Use sequence alignment algorithms!
105. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Product
A={a1,…., an} and B={b1,…., bn}
SpectralproductA B: two-dimensional matrix with
nm 1s corresponding to all pairs of
indices (ai,bj) and remaining
elements being 0s.
10 20 30 40 50 55 65 75 85 95
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
SPC: the number of 1s at
the main diagonal.
-shifted SPC: the number
of 1s on the diagonal (i,i+ )
106. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Alignment: k-similarity
k-similarity between spectra: the maximum number
of 1s on a path through this graph that uses at most
k+1 diagonals.
k-optimal spectral
alignment = a path.
The spectral alignment
allows one to detect
more and more subtle
similarities between
spectra by increasing k.
107. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
SPC reveals only
D(0)=3 matching
peaks.
Spectral Alignment
reveals more
hidden similarities
between spectra:
D(1)=5 and D(2)=8
and detects
corresponding
mutations.
Use of k-Similarity
108. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Black line represent the path for k=0
Red lines represent the path for k=1
Blue lines (right) represents the path for k=2
110. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Dynamic Programming for Spectral Alignment
Dij(k): the maximum number of 1s on a path to
(ai,bj) that uses at most k+1 diagonals.
(i’,j’) ~ (i,j): the points are located on the same diagonal
Running time: O(n4 k)
otherwisekD
jijiifkD
kD
ji
ji
jiji
ij
,1)1(
),(~)','(,1)(
max)(
''
''
),()','(
{
)(max)( kDkD ij
ij
111. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Edit Graph for Fast Spectral Alignment
diag(i,j) – the position
of previous 1 on the
same diagonal as (i,j)
112. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Fast Spectral Alignment Algorithm
1)1(
1)(
max)(
1,1
),(
kM
kD
kD
ji
jidiag
ij
)(max)( ''
),()','(
kDkM ji
jiji
ij
)(
)(
)(
max)(
1,
,1
kM
kM
kD
kM
ji
ji
ij
ij
Running time: O(n2 k)
113. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Alignment: Complications
Spectra are combinations of an increasing (N-
terminal ions) and a decreasing (C-terminal
ions) number series.
These series form two diagonals in the
spectral product, the main diagonal and the
perpendicular diagonal.
The described algorithm deals with the main
diagonal only.
114. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Spectral Alignment: Complications
• Simultaneous analysis of N- and C-terminal
ions
• Taking into account the intensities and
charges
• Analysis of minor ions
115. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
MS/MS and East-West Traveling Salesman Problem
116. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
EE SE E SE D E SE D T E S
129
244
345
47487
216
317
432
E D T
ES
S
E S
T D E
E
S
Anti-symmetric paths
E D T
ES
S
E S
T D E
E
S
Anti-symmetric path problem (Tim Chen, SODA 2001) avoids reusing the same
peak twice (as both b- and y-ions) in the optimal path
Without anti-symmetric condition, the correct peptide SETDEwould be
missed and the algorithm would output a symmetric peptide:
ESESE
Why?
De-novo problem
Input: Spectrum graph
Output: Anti-symmetric longestpath
119. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Filtration in Similarity Searches
Scoring
Protein Query
SequenceAlignment – Smith-Waterman Algorithm
Sequence matches
Filtration
SequenceAlignment: – BLAST
Database
actgcgctagctacggatagctgatcc
agatcgatgccataggtagctgatcc
atgctagcttagacataaagcttgaat
cgatcgggtaacccatagctagctcg
atcgacttagacttcgattcgatcgaat
tcgatctgatctgaatatattaggtccg
atgctagctgtggtagtgatgtaaga
• BLAST filters out very few correct
matches and is almost as accurate as
Smith – Waterman algorithm.
Database
actgcgctagctacggatagctgatcc
agatcgatgccataggtagctgatcc
atgctagcttagacataaagcttgaat
cgatcgggtaacccatagctagctcg
atcgacttagacttcgattcgatcgaat
tcgatctgatctgaatatattaggtccg
atgctagctgtggtagtgatgtaaga
121. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Filtration in MS/MS Sequencing
• Filtration in MS/MS is more difficult than in BLAST.
• Early approaches using Peptide Sequence Tags were
not able to substitute the complete database search.
• Can we design a filtration based search that can replace
the database search, and is orders of magnitude faster?
122. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Asking the Old Question Again: Why
Not Sequence De Novo?
• De novo sequencing is still not very accurate!
Algorithm Amino Acid
Accuracy
Whole Peptide
Accuracy
Lutefisk (Taylor and Johnson, 1997). 0.566 0.189
SHERENGA (Dancik et. al., 1999). 0.690 0.289
Peaks (Ma et al., 2003). 0.673 0.246
PepNovo (Frank and Pevzner, 2005). 0.727 0.296
123. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
So What Can be Done with De Novo?
• Given an MS/MS spectrum:
• Can de novo predict the entire peptide sequence?
• Can de novo predict partial sequences?
• Can de novo predict a set of partial sequences, that with
high probability, contains at least one correct tag?
A Covering Set of Tags
- No!
(accuracy is less than 30%).
- No!
(accuracy is 50% for GutenTag and 80% for PepNovo )
- Yes!
124. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Peptide Sequence Tags
• A Peptide Sequence Tag is short substring of
a peptide.
Example: G V D L K
G V D
V D L
D L K
Tags:
125. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Filtration with Peptide Sequence Tags
• Peptide sequence tags can be used as filters in
database searches.
• The Filtration: Consider only database peptides
that contain the tag (in its correct relative mass
location).
• First suggested by Mann and Wilm (1994).
• Similar concepts also used by:
• GutenTag - Tabb et. al. 2003.
• MultiTag - Sunayev et. al. 2003.
• OpenSea - Searle et. al. 2004.
126. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Why Filter Database Candidates?
• Filtration makes genomic database searches practical
(BLAST).
• Effective filtration can greatly speed-up the process,
enabling expensive searches involving post-translational
modifications.
• Goal: generate a small set of covering tags and use them
to filter the database peptides.
127. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Tag Generation - Global Tags
• Parse tags from de novo reconstruction.
• Only a small number of tags can be generated.
• If the de novo sequence is completely incorrect,
none of the tags will be correct.
W
R
A
C
V
G
E
K
DW
L
P
T
L
T
AVGELTK
TAG Prefix Mass
AVG 0.0
VGE 71.0
GEL 170.1
ELT 227.1
LTK 356.2
128. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Tag Generation - Local Tags
• Extract the highest scoring
subspaths from the spectrum graph.
• Sometimes gets misled by locally
promising-looking “garden paths”.
W
R
A
C
V
G
E
K
DW
L
P
T
L T
TAG Prefix Mass
AVG 0.0
WTD 120.2
PET 211.4
129. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Ranking Tags
• Each additional tag used to filter increases the
number of database hits and slows down the
database search.
• Tags can be ranked according to their scores,
however this ranking is not very accurate.
• It is better to determine for each tag the
“probability” that it is correct, and choose most
probable tags.
130. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Reliability of Amino Acids in Tags
• For each amino acid in a tag we want to assign a
probability that it is correct.
• Each amino acid, which corresponds to an edge in the
spectrum graph, is mapped to a feature space that consists
of the features that correlate with reliability of amino acid
prediction, e.g. score reduction due to edge removal
131. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Score Reduction Due to Edge Removal
• The removal of an edge corresponding to a
genuine amino acid usually leads to a reduction
in the score of the de novo path.
• However, the removal of an edge that does not
correspond to a genuine amino acid tends to
leave the score unchanged.
W
R
A
C
V
G
K
DW
L
P
T
L T
W
R
A
C
V
G
K
DW
L
P
T
L T
W
R
A
C
V
G
K
DW
L
P
T
L T
E
W
W
R
A
C
V
G
K
D
L
P
T
L T
132. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Probabilities of Tags
• How do we determine the probability of a
predicted tag ?
• We use the predicted probabilities of its amino
acids and follow the concept:
a chain is only as strong as its weakest link
134. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Tag-based Database Search
Tag filter SignificanceScore
Tag
extension
De novo
Db
55M
peptides
Candidate Peptides (700)
135. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Matching Multiple Tags
• Matching of a sequence tag against a database is
fast
• Even matching many tags against a database is
fast
• k tags can be matched against a database in time
proportional to database size, but independent of
the number of tags.
• keyword trees (Aho-Corasick algorithm)
• Scan time can be amortized by combining scans for
many spectra all at once.
• build one keyword tree from multiple spectra
136. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Keyword Trees
Y A K
SN
N
F
F
AT
YFAK
YFNS
FNTA
…..Y F R A Y F N T A…..
137. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Tag Extension
Filter SignificanceScoreExtension
De novo
Db
55M
peptides
Candidate
Peptides
(700)
138. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
• Given:
• tag with prefix and suffix masses <mP> xyz <mS>
• match in the database
• Compute if a suffix and prefix match with allowable
modifications.
• Compute a candidate peptide with most likely
positions of modifications (attachment points).
Fast Extension
xyz
<mP>xyz<mS>
139. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Scoring Modified Peptides
Filter SignificanceScoreExtension
De novo
Db
55M
peptides
140. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Scoring
• Input:
• Candidate peptide with attached modifications
• Spectrum
• Output:
• Score function that normalizes for length, as
variable modifications can change peptide length.
141. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Assessing Reliability of Identifications
Filter SignificanceScoreextension
De novo
Db
55M
peptides
142. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Selecting Features for Separating
Correct and Incorrect Predictions
• Features:
• Score S: as computed
• Explained Intensity I: fraction of total intensity explained by
annotated peaks.
• b-y score B: fraction of b+y ions annotated
• Explained peaks P: fraction of top 25 peaks annotated.
• Each of I,S,B,P features is normalized (subtract mean
and divide by s.d.)
• Problem: separate correct and incorrect identifications
using I,S,B,P
143. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Separating power of features
144. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Separating power of features
Quality scores:
Q = wI I + wS S + wB B + wP P
The weights are chosen to minimize
the mis-classification error
145. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Distribution of Quality Scores
146. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Results on ISB data-set
• All ISB spectra were searched.
• The top match is valid for 2978 spectra (2765 for Sequest)
• InsPecT-Sequest: 644 spectra (I-S dataset)
• Sequest-InsPecT: 422 spectra (S-I dataset)
• Average explained intensity of I-S = 52%
• Average explained intensity of S-I = 28%
• Average explained intensity I S = 58%
• ~70 Met. Oxidations
• Run time is 0.7 secs. per spectrum (2.7 secs. for Sequest)
147. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Filtration Results
The search was done against SWISS-PROT (54Mb).
• With 10 tags of length 3:
• The filtration is 1500 more efficient.
• Less than 4% of spectra are filtered out.
• The search time per spectrum is reduced by two orders of magnitude
as compared to SEQUEST.
PTMs Tag
Length
# Tags Filtration InsPecT
Runtime
SEQUEST
Runtime
None 3 1 3.4 10-7 0.17 sec > 1 minute
3 10 1.6 10-6 0.27 sec
Phosphory
lation
3 1 5.8 10-7 0.21 sec > 2 minutes
3 10 2.7 10-6 0.38 sec
148. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
SPIDER: Yet Another Application of de novo
Sequencing
• Suppose you have a good MS/MS spectrum of an
elephant peptide
• Suppose you even have a good de novo
reconstruction of this spectra
• However, until elephant genome is sequenced, it is
hard to verify this de novo reconstruction
• Can you search de novo reconstruction of a peptide
from elephant against human protein database?
• SPIDER algorithm addresses this comparative
proteomics problem
Slides from Bin Ma, University of Western Ontario
149. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Common de novo sequencing errors
GG
N and GG
have the
same mass
150. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
From de novo Reconstruction to Database
Candidate through Real Sequence
• Given a sequence with errors, search for
the similar sequences in a DB.
(Seq) X: LSCFAV
(Real) Y: SLCFAV
(Match) Z: SLCF-V
sequencing error
(Seq) X: LSCF-AV
(Real) Y: EACF-AV
(Match) Z: DACFKAV mass(LS)=mass(EA)
Homology mutations
151. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Alignment between de novo Candidate and
Database Candidate
• If real sequence Y is known then:
d(X,Z) = seqError(X,Y) + editDist(Y,Z)
(Seq) X: LSCF-AV
(Real) Y: EACF-AV
(Match) Z: DACFKAV
152. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Alignment between de novo Candidate and
Database Candidate
• If real sequence Y is known then:
d(X,Z) = seqError(X,Y) + editDist(Y,Z)
• If real sequence Y is unknown then the distance between de
novo candidate X and database candidate Z:
• d(X,Z) = minY ( seqError(X,Y) + editDist(Y,Z) )
(Seq) X: LSCF-AV
(Real) Y: EACF-AV
(Match) Z: DACFKAV
153. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Alignment between de novo Candidate and
Database Candidate
• If real sequence Y is known then:
d(X,Z) = seqError(X,Y) + editDist(Y,Z)
• If real sequence Y is unknown then the distance between de
novo candidate X and database candidate Z:
• d(X,Z) = minY ( seqError(X,Y) + editDist(Y,Z) )
• Problem: search a database for Z that minimizes d(X,Z)
• The core problem is to compute d(X,Z) for given X and Z.
(Seq) X: LSCF-AV
(Real) Y: EACF-AV
(Match) Z: DACFKAV
154. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Computing seqError(X,Y)
• Align X and Y (according to mass).
• A segment of X can be aligned to a segment of
Y only if their mass is the same!
• For each erroneous mass block (Xi,Yi), the cost is
f(Xi,Yi)=f(mass(Xi)).
• f(m) depends on how often de novo sequencing
makes errors on a segment with mass m.
• seqError(X,Y) is the sum of all f(mass(Xi)).
X
Y
Z
seqError
editDist
(Seq) X: LSCFAV
(Real) Y: EACFAV
155. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Computing d(X,Z)
• Dynamic Programming:
• Let D[i,j]=d(X[1..i], Z[1..j])
• We examine the last block of the alignment of
X[1..i] and Z[1..j].
(Seq) X: LSCF-AV
(Real) Y: EACF-AV
(Match) Z: DACFKAV
156. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Dynamic Programming: Four Cases
• Cases A, B, C - no
de novo sequencing
errors
• Case D: de novo
sequencing error
D[i,j]=D[i,j-1]+indel D[i,j]=D[i-1,j]+indel
D[i,j]=D[i-1,j-1]+dist(X[i],Z[j]) D[i,j]=D[i’-1,j’-1]+alpha(X[i’..i],Z[j’..j])
• D[i,j]is the
minimum of the
four cases.
157. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Computing alpha(.,.)
• alpha(X[i’..i],Z[j’..j])
= min m(y)=m(X[i’..i]) [seqError (X[i’..i],y)+editDist(y,Z[j’..j])]
= min m(y)=m[i’..i] [f(m[i’..i])+editDist(y,Z[j’..j])].
= f(m[i’..i]) + min m(y)=m[i’..i] editDist(y,Z[j’..j]).
• This is like to align a mass with a string.
• Mass-alignment Problem: Given a mass m and a
peptide P, find a peptide of mass m that is most
similar to P (among all possible peptides)
158. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Solving Mass-Alignment Problem
])[,()])1..([),((min
)])1..([,(
])..[),((min
min])..[,(
jZydistjiZymm
indeljiZm
indeljiZymm
jiZm
y
y
159. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Improving the Efficiency
• Homology Match mode:
• Assumes tagging (only peptides that share a tag of
length 3 with de novo reconstruction are considered)
and extension of found hits by dynamic programming
around the hits.
• Non-gapped homology match mode:
• Sequencing error and homology mutations do not
overlap.
• Segment Match mode:
• No homology mutations.
• Exact Match mode:
• No sequencing errors and homology mutations.
160. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Experiment Result
• The correct peptide sequence for each spectrum is known.
• The proteins are all in Swissprot but not in Human
database.
• SPIDER searches 144 spectra against both Swissprot and
human databases
161. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Example
• Using de novo reconstruction X=CCQWDAEACAFNNPGK,
the homolog Z was found in human database. At the same
time, the correct sequence Y, was found in SwissProt
database.
Seq(X): CCQ[W ]DAEAC[AF]<NN><PG>K
Real(Y): CCK AD DAEAC FA VE GP K
Database(Z): CCK[AD]DKETC[FA]<EE><GK>K
sequencing errors
homology mutations