This document provides an overview of databases, definitions, scoring matrices, and pairwise sequence alignment. It discusses major bioinformatics databases like NCBI, ExPASy, and EBI. It also defines key terms like identity, homology, orthologous, and paralogous sequences. Additionally, it examines the theoretical and empirical bases for scoring matrices like PAM, BLOSUM, and transition/transversion matrices, and how they are used in sequence alignment.
This document provides an overview of topics to be covered in a bioinformatics course, including biological databases, sequence similarity scoring matrices, sequence alignments, database searching, phylogenetics, protein structure, gene prediction, and other topics. A schedule is given listing the topics and dates. Background information is also provided on definitions, major bioinformatics databases, scoring matrices, and sequence alignments.
Scoring system is a set of values for qualifying the set of one residue being substituted by another in an alignment.
It is also known as substitution matrix.
Scoring matrix of nucleotide is relatively simple.
A positive value or a high score is given for a match & negative value or a low score is given for a mismatch.
Scoring matrices for amino acids are more complicated because scoring has to reflect the physicochemical properties of amino acid residues.
Scoring schemes in bioinformatics (blosum)SumatiHajela
This document discusses scoring schemes in bioinformatics, specifically BLOSUM (BLOcks SUbstitution Matrix). It introduces BLOSUM, describing that it is based on conserved amino acid patterns from multiple sequence alignments. It then explains the BLOSUM-62 matrix and the BLOSUM scoring algorithm. The document contrasts BLOSUM with PAM matrices, noting key differences like BLOSUM being based on direct observations while PAM uses evolutionary modeling. Finally, it outlines the significance of scoring matrices for detecting distant evolutionary relationships between protein sequences.
I am John D. I am a Computation and System Biology Assignment Expert at nursingassignmenthelp.com. I hold a Ph.D in Biology, from Arizona University the US. I have been helping students with their assignments for the past 9 years. I solve assignments related to Computation and System Biology.
Visit nursingassignmenthelp.com or email info@nursingassignmenthelp.com.
You can also call on +1 678 648 4277 for any assistance with Computation and System Biology Assignments.
This document discusses sequence alignment, which involves placing two or more biological sequences in an optimal alignment to identify regions of similarity and deduce evolutionary relationships. It defines key terms like similarity, identity, conservation, and optimal alignment. It also describes the rationale for alignment, which is to compare sequences and find similarities that provide insights into biological function and evolutionary history. Finally, it outlines different types of alignment like global, local, pairwise, and multiple sequence alignment.
Introduction to sequence alignment partiiSumatiHajela
This document provides an introduction to sequence alignment and discusses gaps and gap penalties. It defines a match and gap in sequence alignment and how substitutions, deletions and insertions are represented. It describes different types of gaps including constant, linear, affine, convex and profile-based variable penalties. Highlights include that gaps allow alignment extension and introduce uncertainty, so penalties are used. Examples demonstrate assigning regular and affine gap penalties.
This document discusses various statistical methods used in bioinformatics, including dot matrix analysis and dynamic programming algorithms. Dot matrix analysis can be used to compare DNA or protein sequences visually and look for repeats or alignments. Dynamic programming algorithms like those used in BLAST are commonly used to generate optimal global and local sequence alignments by calculating alignment scores based on substitution matrices like PAM and BLOSUM.
1) This document introduces methods for detecting sequence similarity, which is a fundamental analysis in bioinformatics.
2) It describes how to search databases for similar sequences using BLAST or FASTA, and how to compare two sequences using dynamic programming algorithms like Needleman-Wunsch or Smith-Waterman.
3) Substitution matrices like BLOSUM62 are used to score alignments and measure sequence similarity based on amino acid properties.
This document provides an overview of topics to be covered in a bioinformatics course, including biological databases, sequence similarity scoring matrices, sequence alignments, database searching, phylogenetics, protein structure, gene prediction, and other topics. A schedule is given listing the topics and dates. Background information is also provided on definitions, major bioinformatics databases, scoring matrices, and sequence alignments.
Scoring system is a set of values for qualifying the set of one residue being substituted by another in an alignment.
It is also known as substitution matrix.
Scoring matrix of nucleotide is relatively simple.
A positive value or a high score is given for a match & negative value or a low score is given for a mismatch.
Scoring matrices for amino acids are more complicated because scoring has to reflect the physicochemical properties of amino acid residues.
Scoring schemes in bioinformatics (blosum)SumatiHajela
This document discusses scoring schemes in bioinformatics, specifically BLOSUM (BLOcks SUbstitution Matrix). It introduces BLOSUM, describing that it is based on conserved amino acid patterns from multiple sequence alignments. It then explains the BLOSUM-62 matrix and the BLOSUM scoring algorithm. The document contrasts BLOSUM with PAM matrices, noting key differences like BLOSUM being based on direct observations while PAM uses evolutionary modeling. Finally, it outlines the significance of scoring matrices for detecting distant evolutionary relationships between protein sequences.
I am John D. I am a Computation and System Biology Assignment Expert at nursingassignmenthelp.com. I hold a Ph.D in Biology, from Arizona University the US. I have been helping students with their assignments for the past 9 years. I solve assignments related to Computation and System Biology.
Visit nursingassignmenthelp.com or email info@nursingassignmenthelp.com.
You can also call on +1 678 648 4277 for any assistance with Computation and System Biology Assignments.
This document discusses sequence alignment, which involves placing two or more biological sequences in an optimal alignment to identify regions of similarity and deduce evolutionary relationships. It defines key terms like similarity, identity, conservation, and optimal alignment. It also describes the rationale for alignment, which is to compare sequences and find similarities that provide insights into biological function and evolutionary history. Finally, it outlines different types of alignment like global, local, pairwise, and multiple sequence alignment.
Introduction to sequence alignment partiiSumatiHajela
This document provides an introduction to sequence alignment and discusses gaps and gap penalties. It defines a match and gap in sequence alignment and how substitutions, deletions and insertions are represented. It describes different types of gaps including constant, linear, affine, convex and profile-based variable penalties. Highlights include that gaps allow alignment extension and introduce uncertainty, so penalties are used. Examples demonstrate assigning regular and affine gap penalties.
This document discusses various statistical methods used in bioinformatics, including dot matrix analysis and dynamic programming algorithms. Dot matrix analysis can be used to compare DNA or protein sequences visually and look for repeats or alignments. Dynamic programming algorithms like those used in BLAST are commonly used to generate optimal global and local sequence alignments by calculating alignment scores based on substitution matrices like PAM and BLOSUM.
1) This document introduces methods for detecting sequence similarity, which is a fundamental analysis in bioinformatics.
2) It describes how to search databases for similar sequences using BLAST or FASTA, and how to compare two sequences using dynamic programming algorithms like Needleman-Wunsch or Smith-Waterman.
3) Substitution matrices like BLOSUM62 are used to score alignments and measure sequence similarity based on amino acid properties.
This document discusses sequence alignment, which is important for predicting function, database searching, gene finding, and studying sequence divergence. It describes global and local alignment, and algorithms like Needleman-Wunsch, Smith-Waterman, and BLAST that are used for sequence alignment. Sequence alignment finds the best match between sequences and can provide information about molecular evolution by identifying mutations, insertions, and deletions.
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
Sequence Alignment Pairwise alignment:- Global Alignment and Local AlignmentTwo types of alignment Progressive Programs for multiple sequence alignment BLOSUM Point accepted mutation (PAM)PAM VS BLOSUM
This study aims to refine existing statistical potentials for modeling RNA-protein interactions using coarse-grained approaches. The researchers applied existing potentials to a lattice model of a RNA-protein complex and identified the native structure within the lowest energy structures. They then derived new potentials from BLAST alignments of RNA sequences and found them to be largely consistent with the original potentials. The researchers propose exploring a larger dataset of RNA-protein complexes and different normalization methods to further improve the potentials for computational modeling of these interactions.
The document discusses sequence similarity searching and comparison. It describes how programs like BLAST and FASTA are used to rapidly identify similarities between sequences and determine evolutionary relationships. BLAST and FASTA utilize word matching and heuristics to efficiently search large databases and return local or global alignments with scoring of matches. They provide a powerful method for functions prediction by comparing new sequences to known genes and proteins.
This document discusses multiple sequence alignment techniques. It begins with definitions of key terms like homology, similarity, and conservation. It then describes pairwise alignment and its applications. The rest of the document focuses on multiple sequence alignment methods like progressive alignment, iterative refinement, tree alignment, star alignment, and using genetic algorithms. It provides examples and explanations of popular multiple sequence alignment tools like Clustal W and T-Coffee.
Sequence homology search and multiple sequence alignment(1)AnkitTiwari354
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal (or lateral) gene transfer event (xenologs).[1]
Homology among DNA, RNA, or proteins is typically inferred from their nucleotide or amino acid sequence similarity. Significant similarity is strong evidence that two sequences are related by evolutionary changes from a common ancestral sequence. Alignments of multiple sequences are used to indicate which regions of each sequence are homologous.
This document discusses various sequence alignment methods. It describes global alignment, which aligns two generally similar sequences over their entire length. Local alignment finds local regions of highest similarity between more divergent sequences. Pairwise alignment compares two sequences, while multiple sequence alignment handles three or more sequences using more sophisticated methods like progressive alignment and iterative alignment. Various online tools for sequence alignment are also mentioned, including BLAST, FASTA, and CLUSTAL Omega.
Sequence alignment involves arranging DNA, RNA, or protein sequences to identify regions of similarity. It is used to determine if sequences are evolutionarily related, observe patterns of conservation, and find similar regions within proteins. The key steps are representation of sequences in a matrix, insertion of gaps, and use of scoring schemes like PAM and BLOSUM matrices to identify the best alignment. Global alignment forces alignment over full sequence lengths while local alignment identifies short, well-matching segments. Algorithms like Needleman-Wunsch and Smith-Waterman use dynamic programming to calculate optimal pairwise sequence alignments.
Phylogenetic prediction - maximum parsimony methodAfnan Zuiter
This document discusses the maximum parsimony method for phylogenetic prediction. It minimizes the number of evolutionary changes needed to produce observed genetic variations between sequences. A multiple sequence alignment is required to identify corresponding positions, which are analyzed to find the tree(s) requiring the fewest evolutionary changes overall. It is best for small numbers of similar sequences but becomes computationally intensive with many diverse sequences. Software like PAUP automates maximum parsimony analysis.
This document provides an overview of multiple sequence alignment (MSA). MSA is used to align biological sequences, such as DNA, RNA, or proteins, to find similarities and differences between sequences. The document outlines the goal of MSA as finding structural, functional, and evolutionary relationships. It describes the general considerations and steps of MSA, including pairwise comparison of sequences, cluster analysis to generate a hierarchy, and progressive alignment. Common MSA software and applications are also summarized.
This document discusses various topics related to transcriptomics and lexico-syntactic analysis including:
- The challenges of resolving gene names and identifiers from multiple databases.
- Using text mining of biomedical abstracts to identify regulatory relationships between genes/proteins by parsing sentences and identifying relationships.
- Integrating gene expression data from multiple experiments and species to infer functional links between genes based on correlations in expression profiles.
- Other related resources discussed include the STRING database for predicting protein-protein interactions and the use of gene synonyms to integrate different types of biological data.
Increasingly Accurate Representation of Biochemistry (v2)Michel Dumontier
Biochemical ontologies aim to capture and represent biochemical entities and the relations that exist between them in an accurate manner. A fundamental starting point is biochemical identity, but our current approach for generating identifiers is haphazard and consequently integrating data is error-prone. I will discuss plausible structure-based strategies for biochemical identity whether it be at molecular level or some part thereof (e.g. residues, collection of residues, atoms, collection of atoms, functional groups) such that identifiers may be generated in an automatic and curator/database independent manner. With structure-based identifiers in hand, we will be in a position to more accurately capture context-specific biochemical knowledge, such as how a set of residues in a binding site are involved in a chemical reaction including the fact that a key nitrogen atom must first be de-protonated. Thus, our current representation of biochemical knowledge may improve such that manual and automatic methods of bio-curation are substantially more accurate.
This study aims to create a fruit fly model with loss of the TMEM184b gene to investigate its role in neural health and axon degeneration. Researchers isolated genomic DNA from wild type flies and used PCR and restriction enzymes to amplify and digest regions surrounding the TMEM184b gene. This DNA will be ligated into a donor plasmid and injected into fly embryos to remove TMEM184b from the genome. Preliminary results showed successful cloning of the downstream homology arm and amplification of potential upstream arms. Future work will insert the amplified region into the targeting vector and inject constructs into flies to study the effects of TMEM184b loss on axon integrity after injury and synaptic structure.
Multiple sequence alignment (MSA) aligns three or more biological sequences, like proteins or nucleic acids, to infer homology and evolutionary relationships. There are three main methods - dynamic programming computes an optimal alignment but has high runtime; progressive alignment first does pairwise alignments and adds sequences; iterative alignment successively improves approximations without pairwise alignments. Popular tools for MSA include Clustal W, MAFFT, MUSCLE, and T-Coffee. MSA helps detect similarities, conserved motifs, and structural homologies between sequences.
This document discusses various methods for predicting genes and analyzing unknown DNA sequences, including:
- Using profiles, patterns, and hidden Markov models (HMMs) to find conserved sequences and predict protein function
- Ontologies like Gene Ontology that organize genes and gene products in a structured network to facilitate annotation and analysis
- Computational tools like Genefinder and Glimmer that use signals like coding potential, open reading frames, start/stop codons, and sequence similarity to known genes to predict gene structures in sequences
- Integrating multiple lines of evidence, like HMMs, EST alignments, repeats, and CpG islands, can improve gene prediction over a single method.
The document discusses multiple sequence alignment methods. It describes ClustalW, a commonly used progressive alignment method that first performs pairwise alignments of sequences and constructs a guide tree before progressively aligning sequences based on the tree. ClustalW is fast but has limitations as it is a heuristic that may not find the optimal alignment and provides no way to quantify alignment accuracy.
The document discusses using proteomics to develop vaccines. It describes how proteomics can help understand protein interactions for vaccine development. The document then focuses on developing a vaccine for Lassa fever. It outlines computational methods used to analyze the Lassa virus glycoprotein, including determining its structure, domains, and interactions within cells. The goal is to use this analysis to develop a stabilized vaccine candidate against Lassa virus that can protect humans.
This document describes a new method for specifically regulating genes between bacterial species using the CRISPR interference (CRISPRi) system. Researchers engineered a CRISPRi system on a conjugative plasmid to target and repress a fluorescent reporter gene (mRFP) in a recipient E. coli strain. The CRISPRi plasmid was transferred from a donor E. coli strain to the recipient strain through bacterial conjugation. When induced in the recipient, the CRISPRi system specifically repressed mRFP expression by 330-fold without affecting expression of another fluorescent reporter (sfGFP), demonstrating targeted gene regulation between bacterial cells via a natural horizontal gene transfer mechanism.
The document provides an overview of sequence alignment concepts including:
- Definitions of terms like identity, homology, orthologous, and paralogous genes
- Examples and explanations of scoring matrices used for nucleotide and protein sequence alignments like BLOSUM and PAM matrices
- An example multiple sequence alignment of glyceraldehyde-3-phosphate dehydrogenases from different species
- Descriptions of how scoring matrices are used to quantify sequence similarity and their importance in sequence analysis
This document provides an overview of sequence alignment and scoring matrices. It defines key terms like identity, homology, orthologous, and paralogous genes. It discusses different types of scoring matrices, including unitary matrices that score matches as 1 and mismatches as 0, and transition/transversion matrices that account for the different likelihood of transition vs. transversion mutations in DNA. It explains that scoring matrices represent implicit models of evolution and influence sequence analysis outcomes. The document emphasizes that results depend critically on the chosen scoring matrix and model.
This document provides an overview of sequence alignment and scoring matrices. It defines key terms like identity, homology, orthologous, and paralogous genes. It discusses different types of scoring matrices including unitary matrices that score matches as 1 and mismatches as 0, and transition/transversion matrices that account for the higher likelihood of transitional mutations in nucleic acids. The document emphasizes that scoring matrices represent underlying evolutionary models and influence sequence analysis outcomes.
This document discusses sequence alignment, which is important for predicting function, database searching, gene finding, and studying sequence divergence. It describes global and local alignment, and algorithms like Needleman-Wunsch, Smith-Waterman, and BLAST that are used for sequence alignment. Sequence alignment finds the best match between sequences and can provide information about molecular evolution by identifying mutations, insertions, and deletions.
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
Sequence Alignment Pairwise alignment:- Global Alignment and Local AlignmentTwo types of alignment Progressive Programs for multiple sequence alignment BLOSUM Point accepted mutation (PAM)PAM VS BLOSUM
This study aims to refine existing statistical potentials for modeling RNA-protein interactions using coarse-grained approaches. The researchers applied existing potentials to a lattice model of a RNA-protein complex and identified the native structure within the lowest energy structures. They then derived new potentials from BLAST alignments of RNA sequences and found them to be largely consistent with the original potentials. The researchers propose exploring a larger dataset of RNA-protein complexes and different normalization methods to further improve the potentials for computational modeling of these interactions.
The document discusses sequence similarity searching and comparison. It describes how programs like BLAST and FASTA are used to rapidly identify similarities between sequences and determine evolutionary relationships. BLAST and FASTA utilize word matching and heuristics to efficiently search large databases and return local or global alignments with scoring of matches. They provide a powerful method for functions prediction by comparing new sequences to known genes and proteins.
This document discusses multiple sequence alignment techniques. It begins with definitions of key terms like homology, similarity, and conservation. It then describes pairwise alignment and its applications. The rest of the document focuses on multiple sequence alignment methods like progressive alignment, iterative refinement, tree alignment, star alignment, and using genetic algorithms. It provides examples and explanations of popular multiple sequence alignment tools like Clustal W and T-Coffee.
Sequence homology search and multiple sequence alignment(1)AnkitTiwari354
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal (or lateral) gene transfer event (xenologs).[1]
Homology among DNA, RNA, or proteins is typically inferred from their nucleotide or amino acid sequence similarity. Significant similarity is strong evidence that two sequences are related by evolutionary changes from a common ancestral sequence. Alignments of multiple sequences are used to indicate which regions of each sequence are homologous.
This document discusses various sequence alignment methods. It describes global alignment, which aligns two generally similar sequences over their entire length. Local alignment finds local regions of highest similarity between more divergent sequences. Pairwise alignment compares two sequences, while multiple sequence alignment handles three or more sequences using more sophisticated methods like progressive alignment and iterative alignment. Various online tools for sequence alignment are also mentioned, including BLAST, FASTA, and CLUSTAL Omega.
Sequence alignment involves arranging DNA, RNA, or protein sequences to identify regions of similarity. It is used to determine if sequences are evolutionarily related, observe patterns of conservation, and find similar regions within proteins. The key steps are representation of sequences in a matrix, insertion of gaps, and use of scoring schemes like PAM and BLOSUM matrices to identify the best alignment. Global alignment forces alignment over full sequence lengths while local alignment identifies short, well-matching segments. Algorithms like Needleman-Wunsch and Smith-Waterman use dynamic programming to calculate optimal pairwise sequence alignments.
Phylogenetic prediction - maximum parsimony methodAfnan Zuiter
This document discusses the maximum parsimony method for phylogenetic prediction. It minimizes the number of evolutionary changes needed to produce observed genetic variations between sequences. A multiple sequence alignment is required to identify corresponding positions, which are analyzed to find the tree(s) requiring the fewest evolutionary changes overall. It is best for small numbers of similar sequences but becomes computationally intensive with many diverse sequences. Software like PAUP automates maximum parsimony analysis.
This document provides an overview of multiple sequence alignment (MSA). MSA is used to align biological sequences, such as DNA, RNA, or proteins, to find similarities and differences between sequences. The document outlines the goal of MSA as finding structural, functional, and evolutionary relationships. It describes the general considerations and steps of MSA, including pairwise comparison of sequences, cluster analysis to generate a hierarchy, and progressive alignment. Common MSA software and applications are also summarized.
This document discusses various topics related to transcriptomics and lexico-syntactic analysis including:
- The challenges of resolving gene names and identifiers from multiple databases.
- Using text mining of biomedical abstracts to identify regulatory relationships between genes/proteins by parsing sentences and identifying relationships.
- Integrating gene expression data from multiple experiments and species to infer functional links between genes based on correlations in expression profiles.
- Other related resources discussed include the STRING database for predicting protein-protein interactions and the use of gene synonyms to integrate different types of biological data.
Increasingly Accurate Representation of Biochemistry (v2)Michel Dumontier
Biochemical ontologies aim to capture and represent biochemical entities and the relations that exist between them in an accurate manner. A fundamental starting point is biochemical identity, but our current approach for generating identifiers is haphazard and consequently integrating data is error-prone. I will discuss plausible structure-based strategies for biochemical identity whether it be at molecular level or some part thereof (e.g. residues, collection of residues, atoms, collection of atoms, functional groups) such that identifiers may be generated in an automatic and curator/database independent manner. With structure-based identifiers in hand, we will be in a position to more accurately capture context-specific biochemical knowledge, such as how a set of residues in a binding site are involved in a chemical reaction including the fact that a key nitrogen atom must first be de-protonated. Thus, our current representation of biochemical knowledge may improve such that manual and automatic methods of bio-curation are substantially more accurate.
This study aims to create a fruit fly model with loss of the TMEM184b gene to investigate its role in neural health and axon degeneration. Researchers isolated genomic DNA from wild type flies and used PCR and restriction enzymes to amplify and digest regions surrounding the TMEM184b gene. This DNA will be ligated into a donor plasmid and injected into fly embryos to remove TMEM184b from the genome. Preliminary results showed successful cloning of the downstream homology arm and amplification of potential upstream arms. Future work will insert the amplified region into the targeting vector and inject constructs into flies to study the effects of TMEM184b loss on axon integrity after injury and synaptic structure.
Multiple sequence alignment (MSA) aligns three or more biological sequences, like proteins or nucleic acids, to infer homology and evolutionary relationships. There are three main methods - dynamic programming computes an optimal alignment but has high runtime; progressive alignment first does pairwise alignments and adds sequences; iterative alignment successively improves approximations without pairwise alignments. Popular tools for MSA include Clustal W, MAFFT, MUSCLE, and T-Coffee. MSA helps detect similarities, conserved motifs, and structural homologies between sequences.
This document discusses various methods for predicting genes and analyzing unknown DNA sequences, including:
- Using profiles, patterns, and hidden Markov models (HMMs) to find conserved sequences and predict protein function
- Ontologies like Gene Ontology that organize genes and gene products in a structured network to facilitate annotation and analysis
- Computational tools like Genefinder and Glimmer that use signals like coding potential, open reading frames, start/stop codons, and sequence similarity to known genes to predict gene structures in sequences
- Integrating multiple lines of evidence, like HMMs, EST alignments, repeats, and CpG islands, can improve gene prediction over a single method.
The document discusses multiple sequence alignment methods. It describes ClustalW, a commonly used progressive alignment method that first performs pairwise alignments of sequences and constructs a guide tree before progressively aligning sequences based on the tree. ClustalW is fast but has limitations as it is a heuristic that may not find the optimal alignment and provides no way to quantify alignment accuracy.
The document discusses using proteomics to develop vaccines. It describes how proteomics can help understand protein interactions for vaccine development. The document then focuses on developing a vaccine for Lassa fever. It outlines computational methods used to analyze the Lassa virus glycoprotein, including determining its structure, domains, and interactions within cells. The goal is to use this analysis to develop a stabilized vaccine candidate against Lassa virus that can protect humans.
This document describes a new method for specifically regulating genes between bacterial species using the CRISPR interference (CRISPRi) system. Researchers engineered a CRISPRi system on a conjugative plasmid to target and repress a fluorescent reporter gene (mRFP) in a recipient E. coli strain. The CRISPRi plasmid was transferred from a donor E. coli strain to the recipient strain through bacterial conjugation. When induced in the recipient, the CRISPRi system specifically repressed mRFP expression by 330-fold without affecting expression of another fluorescent reporter (sfGFP), demonstrating targeted gene regulation between bacterial cells via a natural horizontal gene transfer mechanism.
The document provides an overview of sequence alignment concepts including:
- Definitions of terms like identity, homology, orthologous, and paralogous genes
- Examples and explanations of scoring matrices used for nucleotide and protein sequence alignments like BLOSUM and PAM matrices
- An example multiple sequence alignment of glyceraldehyde-3-phosphate dehydrogenases from different species
- Descriptions of how scoring matrices are used to quantify sequence similarity and their importance in sequence analysis
This document provides an overview of sequence alignment and scoring matrices. It defines key terms like identity, homology, orthologous, and paralogous genes. It discusses different types of scoring matrices, including unitary matrices that score matches as 1 and mismatches as 0, and transition/transversion matrices that account for the different likelihood of transition vs. transversion mutations in DNA. It explains that scoring matrices represent implicit models of evolution and influence sequence analysis outcomes. The document emphasizes that results depend critically on the chosen scoring matrix and model.
This document provides an overview of sequence alignment and scoring matrices. It defines key terms like identity, homology, orthologous, and paralogous genes. It discusses different types of scoring matrices including unitary matrices that score matches as 1 and mismatches as 0, and transition/transversion matrices that account for the higher likelihood of transitional mutations in nucleic acids. The document emphasizes that scoring matrices represent underlying evolutionary models and influence sequence analysis outcomes.
This document discusses sequence alignment and contains four sections:
1) Global alignment which finds the highest scoring alignment between entire sequences using dynamic programming.
2) Scoring matrices which generalize alignment scoring by assigning scores to individual character matches/mismatches based on biological evidence.
3) Local alignment which finds the best scoring alignment between substrings of sequences to identify conserved regions, as global alignment may miss these.
4) Ways to solve the local alignment problem efficiently in quadratic time instead of quartic time by computing alignments from each vertex in the grid.
1. The document discusses various bioinformatics concepts and tools including sequence alignment, BLAST, substitution matrices, and open reading frames. Sequence alignment involves comparing sequences to find similar regions and can be local or global. BLAST is a tool used to find similar sequences in a database by searching for exact and similar matches. Substitution matrices like BLOSUM and PAM assign scores to amino acid substitutions observed in protein evolution. Open reading frames refer to the three possible frames for translating a nucleic acid sequence into a protein.
The document discusses bioinformatics and computational biology. It describes a lab with over 100 people from diverse backgrounds, including engineers, scientists, technicians, geneticists and clinicians. The lab applies information technology to analyze biological data, focusing on areas like sequence analysis, molecular modeling, phylogeny, medical applications, statistics and more. Specific applications mentioned include analyzing genomes to study genetic diseases and drug design, as well as using the same techniques in agriculture and animal health.
The document discusses pairwise sequence alignment methods. It defines key concepts like homology and orthology. It explains that dynamic programming is used to find optimal alignments through building a score matrix and backtracking. Global alignment finds the best match over full sequences while local alignment identifies regions of local similarity. Scoring systems like PAM matrices assign values based on substitutions and penalties for gaps.
BLAST and FASTA are algorithms for searching sequence databases to find local alignments between a query sequence and database sequences, with BLAST providing faster searches and improved statistical analysis compared to FASTA. Both algorithms work by first identifying short exact matches between sequences and then extending these matches to identify longer regions of similarity. The algorithms model DNA and protein sequence alignments as coin tosses to determine the expected length of the longest matching region between random sequences.
This document discusses using stochastic models to understand principles of gene regulation from regulatory DNA architecture. It summarizes that regulating promoters downstream of irreversible assembly steps reduces molecular noise compared to regulating initiation rates. Distributed binding sites across enhancers also helps reduce expression noise. The document proposes relating complex biochemical architectures of promoters and enhancers to their transcriptional properties using finite Markov chain approaches.
The document discusses:
1) An overview of bioinformatics lessons including introductions to databases, scoring matrices, and pairwise sequence alignment.
2) Descriptions of major bioinformatics databases and resources including NCBI, ExPASy, and EBI.
3) The importance of scoring matrices in sequence analysis and how the choice of matrix can influence outcomes. Matrices are discussed for nucleotides and proteins.
Molecular docking and simulation can be used as tools in drug discovery for the renin-angiotensin system. Docking aims to characterize the binding site, orient ligands into the site, and evaluate the strength of interaction. Molecular dynamics simulation allows studying the motional properties of proteins like renin. Docking of piperidine-containing compounds with renin showed hydrogen bonding interactions between the ligands and binding site residues like Ser230, Asp38, Gly228, Tyr20 that stabilize binding. ADME/toxicity prediction and further docking/evaluation can aid in developing renin inhibitors.
2016.09.28
TOPIC REVIEW
• Exam
• PS2 Sequence Alignment
• Command Line Blast
• PS1 Molecular Biology
• Personal Microbiome Project
CURRENTLY
LET’S NEGOTIATE
• Problem sets (4) - 10%
• Microbiome project - 20%
• Exam (1) - 20%
• Research project - 45%
• Participation - 5%
OR
• Problem sets (4) - 10%
• Microbiome project - 20%
• Exam 1 - 15%
• Exam 2 - 15%
• Research project - 35%
• Participation - 5%
PS2 SEQUENCE ALIGNMENT
PS2 SEQUENCE ALIGNMENT
RefSeqs, protein (experimentally supported)
On chromosome 17
Reverse strand
PRCD Progressive rod-cone degeneration
PS2: GLOBAL ALIGNMENT
BLOSUM62
• substitutions less penalized and are
preferred to gaps. There is also a
decrease in the level of identity.
BLOSUM80
• Substitutions more penalized and
gaps are favored.
PAM60
• Substitutions more penalized and gaps
are favored.
PAM250
• substitutions less penalized and are
preferred to gaps. There is also a
decrease in the level of identity.
PS2: LOCAL ALIGNMENT
SEQ1 A L S C V W M I P
SEQ2 A I S C M I P T
9 residues
8 residues
Create Matrix: length of seq1 + 1
x
length of seq2 + 1
Matrix 10 x 9
A L S C V W M I P
0 -2 -4 -6 -8 -10 -12 -14 -16 -18
-2
-4
-6
-8
-10
-12
-14
-16
A
I
S
C
M
I
P
T
Exercise: fill the scores of the alignment matrix
using the BLOSUM62 substitution matrix.
Gap opening penalty: -5
Gap extension penalty: -1
S V E T D
T
S
I
N
Q
E
T
Ala A 4
Arg R -1 5
Asn N -2 0 6
Asp D -2 -2 1 6
Cys C 0 -3 -3 -3 9
Gln Q -1 1 0 0 -3 5
Glu E -1 0 0 2 -4 2 5
Gly G 0 -2 0 -1 -3 -2 -2 6
His H -2 0 1 -1 -3 0 0 -2 8
Ile I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
Leu L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
Lys K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
Met M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
Phe F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
Pro P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
Ser S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
Thr T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
Trp W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Tyr Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
Val V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A
la
A
rg
A
sn
A
sp
C
y
s
G
ln
G
lu
G
ly
H
is
Il
e
L
e
u
L
y
s
M
e
t
P
h
e
P
ro
S
e
r
T
h
r
T
rp
T
y
r
V
a
l
A R N D C Q E G H I L K M F P S T W Y V
Dynamical programming - global alignment
83
BLOSUM62
GAP COST: -2
At each cell, 3 scores are calculated:
• match score = diagonal cell score +
score from the substitution matrix.
• Vertical gap score = upper neighbor
+ gap cost
• Horizontal gap score = left neighbor
+ gap cost
• The highest score is retained and
the arrow is labelled
A L S C V W M I P
0 -2 -4 -6 -8 -10 -12 -14 -16 -18
-2
-4
-6
-8
-10
-12
-14
-16
A
I
S
C
M
I
P
T
Exercise: fill the scores of the alignment matrix
using the BLOSUM62 substitution matrix.
Gap opening penalty: -5
Gap extension penalty: -1
S V E T D
T
S
I
N
Q
E
T
A ...
This document discusses protein identification from mass spectrometry data. It describes how tandem mass spectrometry is used to break proteins into peptides and peptides into fragment ions. The fragment ion masses can then be used to reconstruct the peptide sequence de novo or search protein databases to identify the source protein. Key algorithms discussed include de novo sequencing, database search tools like Sequest and Mascot, and techniques like InsPecT that can rapidly search large databases or analyze many spectra.
This document provides an overview of protein identification using mass spectrometry. It discusses how tandem mass spectrometry is used to break proteins into peptides and peptides into fragment ions. The fragment ion masses are then used to reconstruct peptides through de novo sequencing or database searching against a protein database. The document compares de novo sequencing, which reconstructs peptides from fragment ion masses, to database searching, which matches experimental spectra to theoretical spectra from a database.
Technology used for High Level Expression and Purification of Recombinant Pro...SookYee1234
The document discusses protein expression and purification techniques. It describes (1) in vivo and in vitro cell-based protein expression systems, (2) transfection of cells with DNA vectors followed by lysis to extract proteins, (3) use of affinity tags like poly-His tags and GST tags to purify recombinant proteins, and (4) common purification methods like immobilized metal affinity chromatography (IMAC). The document concludes that fusing proteins with ubiquitin allows high-level expression, easy purification, and production of authentic proteins for downstream applications.
The document discusses dot plots and their use in bioinformatics. It explains that dot plots are a graphical representation that uses two sequences as axes and plots dots where regions of similarity are found based on a given threshold and window size. Dot plots can be used to visualize all similarities and repeats within and between sequences. Reducing window size and increasing stringency can reduce noise in dot plots. Available programs for generating dot plots are also mentioned.
The document discusses using OpenCL to accelerate genomic analysis through parallelization. It introduces OpenCL and provides examples of using it to parallelize algorithms for copy number inference in tumors, computing relatedness between individuals, and performing variable selection in regression. Key applications discussed include hidden Markov models for copy number inference, principal component analysis on relatedness matrices, and coordinate descent algorithms for lasso regression. Performance gains of up to 155x are reported for the parallel implementations compared to serial code.
This document discusses protein structure and bioinformatics. It begins by explaining the rationale for understanding protein structure and function, including determining protein sequences, structures, and relating this to function. It then covers levels of protein structure from primary to quaternary, methods for determining protein structures like X-ray crystallography, and uses of protein modeling and databases. The document provides examples of protein domains, folds, and membrane protein topology. It emphasizes that sequence determines conformation and that structure implies function.
This document discusses dot plots and their use in bioinformatics. It begins by defining dot plots as a graphical representation that uses two sequences on orthogonal axes and plots dots where regions of similarity meet a given threshold within a window. Dot plots allow visualization of all structures in common between sequences or repeated/inverted structures within a sequence. The document provides an example dot plot creation script in Perl and discusses how to reduce noise in dot plots by increasing the window size or stringency. It notes common uses of dot plots like comparing genomic and cDNA sequences to predict exons. Finally, it provides some rules of thumb for effective dot plot analysis and lists available dot plot programs.
Similar to Bioinformatica t3-scoring matrices-wim_vancriekinge_v2013 (20)
This document provides an overview of bioinformatics and biological databases. It discusses how bioinformatics draws from fields like biology, computer science, statistics, and machine learning. Biological databases are important resources for bioinformatics that can be searched and analyzed to answer questions, find similar sequences, locate patterns, and make predictions. The document also outlines common uses of biological databases, such as annotation searches, homology searches, pattern searches, and predictive analyses.
The document discusses the Rh blood group system and its clinical significance. It describes the key observations in 1939 that linked adverse reactions in mothers to stillborn fetuses and blood transfusions from fathers, indicating a relationship. This syndrome is now called hemolytic disease of the fetus and newborn. The Rh system was identified in 1940 through experiments immunizing animals with Rhesus macaque monkey red blood cells. The D antigen is the most important RBC antigen in transfusion practice, as those lacking it do not produce anti-D antibody unless exposed to D antigen through transfusion or pregnancy. Testing for D is routinely performed to ensure D-negative patients receive D-negative blood.
The document discusses views and materialized views in data warehousing and decision support systems. It covers three main points:
1) OLAP queries typically involve aggregate queries, so precomputation is essential for fast response times. Materialized views allow precomputing aggregates across multiple dimensions.
2) Warehouses can be thought of as collections of asynchronously replicated tables and periodically maintained views, renewing interest in efficient view maintenance.
3) Materialized views store the results of views in the database for fast access like a cache, but they require maintenance as underlying tables change. Incremental maintenance algorithms are ideal to efficiently update materialized views.
The document discusses various database concepts including normalization, which is used to design optimal relation schemas by removing redundant data. It also covers transaction processing, which involves executing logical database operations as transactions to maintain data integrity. Database systems use techniques like logging and concurrency control to prevent transaction anomalies and ensure failures can be recovered from.
This document contains a list of names, emails, and study programs of students. It includes their official student code, last name, first name, email, and educational program. There are 20 students listed with their details.
This document discusses the Biological Databases project being conducted by a group of students. The project involves using the video game Minecraft to visualize protein structures retrieved from the Protein Data Bank (PDB). Python scripts are used to import PDB data files and place blocks in Minecraft to represent atoms, with different block colors used to distinguish atom types. SPARQL queries are also employed to search the RDF version of the PDB for protein entries. The goal is to build 3D protein models inside Minecraft for educational and visualization purposes.
The document discusses various bioinformatics tools and algorithms for analyzing protein sequences, including Biopython for working with biological sequence data, the Kyte-Doolittle algorithm for predicting transmembrane regions, and the Chou-Fasman algorithm for predicting secondary structure from amino acid preferences for alpha helices, beta sheets, and random coils. It also provides examples of analyzing Swiss-Prot data to find properties of human proteins and applying these tools and libraries to extract insights from protein sequences.
The document discusses various topics related to analyzing protein sequences using Python and Biopython. It provides examples of using Biopython to parse sequence data from UniProt, calculate lengths and translations of sequences. It also discusses analyzing properties of sequences like molecular weight, isoelectric point, transmembrane regions, and comparing sequences to find conserved motifs. Finally, it introduces hydropathy indices and tools for predicting properties like transmembrane helices from primary sequences.
This document discusses Python functions. It explains that there are built-in functions provided as part of Python and user-defined functions. User-defined functions are created using the def keyword and can take parameters and return values. The body of a function is indented and runs when the function is called. Functions allow code to be reused and organized in a modular way. Examples are provided to demonstrate defining and calling functions with different parameters and return values.
The document provides a recap of Python programming concepts like conditions and statements, while loops, for loops, break and continue statements, and working with strings. It also introduces regular expressions as a way to match patterns in strings using a formal language that can be interpreted by a regular expression processor.
[SUMMARY
This document discusses next generation DNA sequencing technologies. It begins by describing some of the limitations of traditional Sanger sequencing, such as read lengths of 500-1000 bases and throughput of 57,000 bases per run. It then introduces some key next generation sequencing technologies, such as 454 sequencing which uses emulsion PCR and pyrosequencing to achieve read lengths of 20-100 bases but higher throughput of 20-100 Mb per run. Illumina/Solexa sequencing is also discussed, which uses sequencing by synthesis with reversible terminators and laser-based detection. Finally, third generation sequencing technologies are mentioned, such as Pacific Biosciences' single molecule real time sequencing and nanopore sequencing. In summary, the document provides a high-level
The document provides an overview of the history and evolution of various programming languages. It discusses early languages like FORTRAN, LISP, PASCAL, C, and Java. It also covers scripting languages and their uses. The document explains what Python is as a programming language - that it is interpreted, object-oriented, and high-level. It was named after Monty Python and was created by Guido van Rossum. The document then gives examples of using Python to program Minecraft by importing protein data from PDB files and using coordinates to place blocks to visualize proteins in the game.
This document provides an introduction to bio-ontologies and the semantic web. It discusses what ontologies are and how they are used in the bio domain through initiatives like the OBO Foundry. It introduces key semantic web technologies like RDF, URIs, Turtle syntax, and SPARQL query language. It provides examples of ontologies like the Gene Ontology and how ontologies can be represented and queried using these semantic web standards.
This document provides an overview of NoSQL databases, including:
- Key-value stores store data as maps or hashmaps and are efficient for data access but limited in query capabilities.
- Column-oriented stores group attributes into column families and store data efficiently but are operationally challenging.
- Document databases store loosely structured data like JSON and allow retrieving documents by keys or contents.
- Graph databases are suited for interaction networks and path finding but are less suited for tabular data.
The document discusses creating a multicore database project. It recommends taking the following steps:
1. Define what the project is about, what it aims to achieve, and who it is for.
2. Identify information resources and develop a basic data model.
3. Design a user interface mockup without technical constraints, thinking creatively.
This document discusses biological databases and PHP. It begins with an overview of biological databases and examples using BIOSQL to load genetic data from GenBank into a MySQL database. It then provides examples of building a basic 3-tier model with Apache, PHP, and a MySQL backend database. The document also includes a brief introduction to PHP, covering its history, why it is commonly used, and basic syntax like conditional statements.
This document discusses biological databases and SQL. It provides an overview of primary and derived data in biological research, as well as different data levels. It then discusses direct querying of selected bioinformatics databases using SQL and provides examples of 3-tier database models. The document proceeds to discuss rationale for learning SQL to query biological databases and provides definitions and explanations of key SQL concepts like tables, records, queries, data types, keys, integrity rules and constraints.
This document discusses biological databases and bioinformatics. It begins with an overview of bioinformatics as an interdisciplinary field combining biology, computer science, and information technology. It then discusses different types of biological databases, including those focused on sequences, pathways, protein structures, and gene expression. The document outlines some common uses of biological databases, including searching for annotations, identifying similar sequences through homology, searching for patterns, and making predictions. It also briefly discusses comparing data across databases. The summary provides a high-level overview of the key topics and uses of biological databases covered in the document.
The document discusses several topics related to protein structure prediction using Python:
1. It introduces the Chou-Fasman algorithm for predicting protein secondary structure from amino acid sequence. The algorithm calculates preference parameters for each amino acid to be in alpha helices, beta sheets, or other structures.
2. It provides an example of calculating helical propensity.
3. It lists the preference parameters output by the Chou-Fasman algorithm for each amino acid.
4. It outlines the steps of applying the Chou-Fasman algorithm to predict secondary structure elements in a protein sequence.
The document provides information on various Python programming concepts including control structures, lists, dictionaries, regular expressions, exceptions, and biological applications using Biopython. It discusses if/else statements, while and for loops, list operations, dictionary usage, regex patterns, exception handling roles, and gives examples analyzing protein sequences and structures using Biopython.
How to Download & Install Module From the Odoo App Store in Odoo 17Celine George
Custom modules offer the flexibility to extend Odoo's capabilities, address unique requirements, and optimize workflows to align seamlessly with your organization's processes. By leveraging custom modules, businesses can unlock greater efficiency, productivity, and innovation, empowering them to stay competitive in today's dynamic market landscape. In this tutorial, we'll guide you step by step on how to easily download and install modules from the Odoo App Store.
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...TechSoup
Whether you're new to SEO or looking to refine your existing strategies, this webinar will provide you with actionable insights and practical tips to elevate your nonprofit's online presence.
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapitolTechU
Slides from a Capitol Technology University webinar held June 20, 2024. The webinar featured Dr. Donovan Wright, presenting on the Department of Defense Digital Transformation.
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.pptHenry Hollis
The History of NZ 1870-1900.
Making of a Nation.
From the NZ Wars to Liberals,
Richard Seddon, George Grey,
Social Laboratory, New Zealand,
Confiscations, Kotahitanga, Kingitanga, Parliament, Suffrage, Repudiation, Economic Change, Agriculture, Gold Mining, Timber, Flax, Sheep, Dairying,
This presentation was provided by Rebecca Benner, Ph.D., of the American Society of Anesthesiologists, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
Temple of Asclepius in Thrace. Excavation resultsKrassimira Luka
The temple and the sanctuary around were dedicated to Asklepios Zmidrenus. This name has been known since 1875 when an inscription dedicated to him was discovered in Rome. The inscription is dated in 227 AD and was left by soldiers originating from the city of Philippopolis (modern Plovdiv).
5. Major sites
NCBI - The National Center for Biotechnology Information
http://www.ncbi.nlm.nih.gov/
The National Center for Biotechnology Information (NCBI) at
the National Library of Medicine (NLM), a part of the National
Institutes of Health (NIH).
ExPASy - Molecular Biology Server
http://expasy.hcuge.ch/www/
Molecular biology WWW server of the Swiss Institute of
Bioinformatics (SIB). This server is dedicated to the analysis of
protein sequences and structures as well as 2-D PAGE
EBI - European Bioinformatics Institute
http://www.ebi.ac.uk/
19. Identity
The extent to which two (nucleotide or amino acid)
sequences are invariant.
Homology
Similarity attributed to descent from a common ancestor.
Definitions
RBP: 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84
+ K ++ + + GTW++MA+ L + A V T + +L+ W+
glycodelin: 23 QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEIVLHRWEN 81
20. Orthologous
Homologous sequences in different species
that arose from a common ancestral gene
during speciation; may or may not be responsible
for a similar function.
Paralogous
Homologous sequences within a single species
that arose by gene duplication.
Definitions
23. This power of sequence alignments
• empirical finding: if two biological
sequences are sufficiently similar, almost
invariably they have similar biological
functions and will be descended from a
common ancestor.
• (i) function is encoded into
sequence, this means: the sequence
provides the syntax and
• (ii) there is a redundancy in the
encoding, many positions in the
sequence may be changed without
perceptible changes in the function, thus
the semantics of the encoding is robust.
25. A metric …
It is very important to realize, that all
subsequent results depend critically on just
how this is done and what model lies at the
basis for the construction of a specific
scoring matrix.
A scoring matrix is a tool to quantify how
well a certain model is represented in the
alignment of two sequences, and any result
obtained by its application is meaningful
exclusively in the context of that model.
26. Scoring matrices appear in all analysis
involving sequence comparison.
The choice of matrix can strongly influence
the outcome of the analysis.
Scoring matrices implicitly represent a
particular theory of evolution.
Understanding theories underlying a given
scoring matrix can aid in making proper
choice.
• Nucleic acid and Protein Scoring Matrices
Importance of scoring matrices
27. • Identity matrix (similarity) BLAST matrix (similarity)
A T C G A T C G
A 1 0 0 0 A 5 -4 -4 -4
T 0 1 0 0 T -4 5 -4 -4
C 0 0 1 0 C -4 -4 5 -4
G 0 0 0 1 G -4 -4 -4 5
• Transition/Transversion Matrix
A T C G
A 0 5 5 1
T 5 0 1 5
C 5 1 0 5
G 1 5 5 0
Nucleic Acid Scoring Matrices
G and C
purine-pyrimidine
A and T
purine -pyrimidine
28. • Nucleotide bases fall into two
categories depending on the ring
structure of the base. Purines
(Adenine and Guanine) are two ring
bases, pyrimidines (Cytosine and
Thymine) are single ring bases.
Mutations in DNA are changes in
which one base is replaced by
another.
• A mutation that conserves the ring
number is called a transition (e.g., A
-> G or C -> T) a mutation that
changes the ring number are called
transversions. (e.g. A -> C or A -> T
and so on).
A T C G
A 0 5 5 1
T 5 0 1 5
C 5 1 0 5
G 1 5 5 0
Transition/Transversion Matrix
29. • Although there are more ways to
create a transversion, the number
of transitions observed to occur in
nature (i.e., when comparing
related DNA sequences) is much
greater. Since the likelihood of
transitions is greater, it is
sometimes desireable to create a
weight matrix which takes this
propensity into account when
comparing two DNA sequences.
• Use of a Transition/Transversion
Matrix reduces noise in
comparisons of distantly related
sequences.
Transition/Transversion Matrix
A T C G
A 0 5 5 1
T 5 0 1 5
C 5 1 0 5
G 1 5 5 0
30. • The simplest metric in use is the
identity metric.
• If two amino acids are the
same, they are given one score, if
they are not, they are given a
different score - regardless, of what
the replacement is.
• One may give a score of 1 for
matches and 0 for mismatches - this
leads to the frequently used unitary
matrix
Protein Scoring Matrices: Unitary Matrix
32. Protein Scoring Matrices: Unitary Matrix
• The simplest matrix:
– High scores for Identities
– Low scores for non-identities
• Works for closely related proteins
• Or one could assign +6 for a match and -1 for
a mismatch, this would be a matrix useful for
local alignment procedures, where a negative
expectation value for randomly aligned
sequences is required to ensure that the score
will not grow simply from extending the
alignment in a random way.
33. A very crude model of an evolutionary
relationship could be implemented in a
scoring matrix in the following way: since
all point-mutations arise from nucleotide
changes, the probability that an observed
amino acid pair is related by chance,
rather than inheritance should depend on
the number of point mutations necessary
to transform one codon into the other.
A metric resulting from this model would
define the distance between two amino
acids by the minimal number of nucleotide
changes required.
Genetic Code Matrix
34. A S G L K V T P E D N I Q R F Y C H M W Z B X
Ala = A O 1 1 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
Ser = S 1 O 1 1 2 2 1 1 2 2 1 1 2 1 1 1 1 2 2 1 2 2 2
Gly = G 1 1 0 2 2 1 2 2 1 1 2 2 2 1 2 2 1 2 2 1 2 2 2
Leu = L 2 1 2 0 2 1 2 1 2 2 2 1 1 1 1 2 2 1 1 1 2 2 2
Lys = K 2 2 2 2 0 2 1 2 1 2 1 1 1 1 2 2 2 2 1 2 1 2 2
Val = V 1 2 1 1 2 0 2 2 1 1 2 1 2 2 1 2 2 2 1 2 2 2 2
Thr = T 1 1 2 2 1 2 0 1 2 2 1 1 2 1 2 2 2 2 1 2 2 2 2
Pro = P 1 1 2 1 2 2 1 0 2 2 2 2 1 1 2 2 2 1 2 2 2 2 2
Glu - E 1 2 1 2 1 1 2 2 0 1 2 2 1 2 2 2 2 2 2 2 1 2 2
Asp = D 1 2 1 2 2 1 2 2 1 O 1 2 2 2 2 1 2 1 2 2 2 1 2
Asn = N 2 1 2 2 1 2 1 2 2 1 O 1 2 2 2 1 2 1 2 2 2 1 2
Ile = I 2 1 2 1 1 1 1 2 2 2 1 0 2 1 1 2 2 2 1 2 2 2 2
Gln = Q 2 2 2 1 1 2 2 1 1 2 2 2 0 1 2 2 2 1 2 2 1 2 2
Arg = R 2 1 1 1 1 2 1 1 2 2 2 1 1 0 2 2 1 1 1 1 2 2 2
Phe = F 2 1 2 1 2 1 2 2 2 2 2 1 2 2 0 1 1 2 2 2 2 2 2
Tyr = Y 2 1 2 2 2 2 2 2 2 1 1 2 2 2 1 O 1 1 3 2 2 1 2
Cys = C 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 0 2 2 1 2 2 2
His = H 2 2 2 1 2 2 2 1 2 1 1 2 1 1 2 1 2 0 2 2 2 1 2
Met = M 2 2 2 1 1 1 1 2 2 2 2 1 2 1 2 3 2 2 0 2 2 2 2
Trp = W 2 1 1 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 0 2 2 2
Glx = Z 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2
Asx = B 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 2 2 2 1 2
??? = X 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
The table is generated by calculating the minimum number of base changes required to
convert an amino acid in row i to an amino acid in column j.
Note Met->Tyr is the only change that requires all 3 codon positions to change.
Genetic Code Matrix
35. This genetic code matrix already
improves sensitivity and specificity
of alignments from the identity
matrix.
The fact that the genetic code matrix
works to align related proteins, in
the same way that matrices derived
from amino-acid properties work
says something very interesting
about the genetic code: namely that
it appears to have evolved to
minimize the effects of point
mutations.
Genetic Code Matrix
37. • Simple identity, which scores only identical amino
acids as a match.
• Genetic code changes, which scores the
minimum number of nucieotide changes to change
a codon for one amino acid into a codon for the
other.
• Chemical similarity of amino acid side
chains, which scores as a match two amino acids
which have a similar side chain, such as
hydrophobic, charged and polar amino acid groups.
Overview
38. All proteins are polymers of the 20 naturally occuring
amino acids. They are listed here along with their
abbreviations :-
Alanine Ala A
Cysteine Cys C
Aspartic AciD Asp D
Glutamic Acid Glu E
Phenylalanine Phe F
Glycine Gly G
Histidine His H
Isoleucine Ile I
Lysine Lys K
Leucine Leu L
Methionine Met M
AsparagiNe Asn N
Proline Pro P
Glutamine Gln Q
ARginine Arg R
Serine Ser S
Threonine Thr T
Valine Val V
Tryptophan Trp W
TYrosine Tyr Y
Amino Acid Residues
39. All amino acids have the
same general formula
Amino Acid Residues
40. • Hydrophobic-aliphatic amino
acids: Their side chains consist of
non-polar methyl- or methylene-
groups.
– These amino acids are usually located
on the interior of the protein as they
are hydrophobic in nature.
– All except for alanine are bifurcated. In
the cases of Val and Ile the bifurcation
is close to the main chain and can
therefore restrict the conformation of
the polypeptide by steric hindrance.
– red and blue atoms represent polar
main chain groups
Amino Acid Residues
42. • Hydrophobic-aromatic: Only
phenylalanine is entirely non-polar.
Tyrosine's phenolic side chain has a
hydroxyl substituent and tryptophan
has a nitrogen atom in its indole ring
sytem.
– These residues are nearly always found
to be largely buried in the hydrophobic
interior of a proteins as they are
prdeominantly non-polar in nature.
– However, the polar atoms of tyrosine
and tryptophan allow hydrogen bonding
interactions to be made with other
residues or even solvent molecules
Amino Acid Residues
44. Neutral-polar side chains: a number of
small aliphatic side chains containing polar
groups which cannot ionize readily.
– Serine and threonine possess hydroxyl groups in
their side chains and as these polar groups are
close to the main chain they can form hydrogen
bonds with it. This can influence the local
conformation of the polypeptide,
– Residues such as serine and asparagine are
known to adopt conformations which most other
amino acids cannot.
– The amino acids asparagine and glutamine
posses amide groups in their side chains which
are usually hydrogen-bonded whenever they
occur in the interior of a protein.
Amino Acid Residues
46. • Acidic amino acids: Aspartate and
glutamate have carboxyl side chains
and are therefore negatively charged
at physiological pH (around neutral).
– The strongly polar nature of these
residues means that they are most often
found on the surface of globular proteins
where they can interact favourably with
solvent molecules.
– These residues can also take part in
electrostatic interactions with positively
charged basic amino acids.
– Aspartate and glutamate also can take
on catalytic roles in the active sites of
enzymes and are well known for their
metal ion binding abilities
Amino Acid Residues
48. • Basic amino acids:
– histidine has the lowest pKa (around 6) and is
therefore neutral at around physiological pH.
• This amino acid occurs very frequently in enzyme
active sites as it can function as a very efficient
general acid-base catalyst.
• It also acts as a metal ion ligand in numerous
protein families.
– Lysine and arginine are more strongly basic and
are positively charged at physiological pH's. They
are generally solvated but do occasionally occur
in the interior of a protein where they are usually
involved in electrostatic interactions with
negatively charged groups such as Asp or Glu.
• Lys and Arg have important roles in anion-binding
proteins as they can interact electrostatically with
the ligand.
Amino Acid Residues
50. Conformationally important residues: Glycine and
proline are unique amino acids. They appear to
influence the conformation of the polypeptide.
• Glycine essentially lacks a side chain and therefore
can adopt conformations which are sterically
forbidden for other amino acids. This confers a high
degree of local flexibility on the polypeptide.
– Accordingly, glycine residues are frequently found in
turn regions of proteins where the backbone has to
make a sharp turn.
– Glycine occurs abundantly in certain fibrous proteins
due to its flexibility and because its small size allows
adjacent polypeptide chains to pack together closely.
• In contrast, proline is the most rigid of the twenty
naturally occurring amino acids since its side chain
is covalently linked with the main chain nitrogen
Amino Acid Residues
52. Here is one list where amino acids are
grouped according to the characteristics of
the side chains:
Aliphatic - alanine, glycine, isoleucine,
leucine, proline, valine,
Aromatic - phenylalanine, tryptophan,
tyrosine,
Acidic - aspartic acid, glutamic acid,
Basic - arginine, histidine, lysine,
Hydroxylic - serine, threonine
Sulphur-containing - cysteine,
methionine
Amidic (containing amide group) -
asparagine, glutamine
Amino Acid Residues
54. Other similarity scoring matrices might be constructed from
any property of amino acids that can be quantified
- partition coefficients between hydrophobic and hydrophilic phases
- charge
- molecular volume
Unfortunately, …
55. AAindex
Amino acid indices and similarity matrices
(http://www.genome.ad.jp/dbget/aaindex.html)
List of 494 Amino Acid Indices in AAindex ver.6.0
• ANDN920101 alpha-CH chemical shifts (Andersen et al., 1992)
• ARGP820101 Hydrophobicity index (Argos et al., 1982)
• ARGP820102 Signal sequence helical potential (Argos et al., 1982)
• ARGP820103 Membrane-buried preference parameters (Argos et al., 1982)
• BEGF750101 Conformational parameter of inner helix (Beghin-Dirkx, 1975)
• BEGF750102 Conformational parameter of beta-structure (Beghin-Dirkx, 1975)
• BEGF750103 Conformational parameter of beta-turn (Beghin-Dirkx, 1975)
• BHAR880101 Average flexibility indices (Bhaskaran-Ponnuswamy, 1988)
• BIGC670101 Residue volume (Bigelow, 1967)
• BIOV880101 Information value for accessibility; average fraction 35% (Biou et al., 1988)
• BIOV880102 Information value for accessibility; average fraction 23% (Biou et al., 1988)
• BROC820101 Retention coefficient in TFA (Browne et al., 1982)
• BROC820102 Retention coefficient in HFBA (Browne et al., 1982)
• BULH740101 Transfer free energy to surface (Bull-Breese, 1974)
• BULH740102 Apparent partial specific volume (Bull-Breese, 1974)
57. • Simple identity, which scores only identical amino
acids as a match.
• Genetic code changes, which scores the
minimum number of nucieotide changes to change
a codon for one amino acid into a codon for the
other.
• Chemical similarity of amino acid side
chains, which scores as a match two amino acids
which have a similar side chain, such as
hydrophobic, charged and polar amino acid groups.
• The Dayhoff percent accepted mutation (PAM)
family of matrices, which scores amino acid pairs
on the basis of the expected frequency of
substitution of one amino acid for the other during
protein evolution.
Overview
58. • In the absence of a valid model
derived from first principles, an
empirical approach
seems more appropriate to score
amino acid similarity.
• This approach is based on
the assumption that once the
evolutionary relationship of two
sequences is
established, the residues that did
exchange are similar.
Dayhoff Matrix
59. Model of Evolution:
“Proteins evolve through a succesion of
independent point mutations, that are
accepted in a population and
subsequently can be observed in the
sequence pool.”
Definition:
The evolutionary distance between two
sequences is the (minimal) number of
point mutations that was necessary to
evolve one sequence into the other
Overview
60. • The model used here states that
proteins evolve through a succesion of
independent point mutations, that are
accepted in a population and
subsequently can be observed in the
sequence pool.
• We can define an evolutionary
distance between two sequences as
the number of point mutations that was
necessary to evolve one sequence into
the other.
Principle
61. • M.O. Dayhoff and colleagues
introduced the term "accepted point
mutation" for a mutation that is stably
fixed in the gene pool in the course
of evolution. Thus a measure of
evolutionary distance between two
sequences can be defined:
• A PAM (Percent accepted mutation)
is one accepted point mutation on
the path between two
sequences, per 100 residues.
Overview
62. First step: finding “accepted mutations”
In order to identify accepted point
mutations, a complete phylogenetic
tree including all ancestral sequences
has to be constructed. To avoid a
large degree of ambiguities in this
step, Dayhoff and colleagues
restricted their analysis to sequence
families with more than 85% identity.
Principles of Scoring Matrix Construction
63. Identification of accepted point mutations:
•Collection of correct (manual) alignments
• 1300 sequences in 72 families
• closely related in order not to get multiply
changes at the same position
• Construct a complete phylogenetic tree including all
ancestral sequences.
• Dayhoff et al restricted their analysis to
sequence families with more than 85%
identity.
• Tabulate into a 20x20 matrix the amino acid pair
exchanges for each of the observed and inferred
sequences.
Overview
64. ACGH DBGH ADIJ CBIJ
/ /
/ /
B - C / A - D B - D / A - C
/ /
/ /
ABGH ABIJ
/
I - G /
J - H /
/
/
|
|
|
Overview
65. Dayhoff’s PAM1 mutation probability matrix (Transition Matrix)
A
Ala
R
Arg
N
Asn
D
Asp
C
Cys
Q
Gln
E
Glu
G
Gly
H
His
I
Ile
A 9867 2 9 10 3 8 17 21 2 6
R 1 9913 1 0 1 10 0 0 10 3
N 4 1 9822 36 0 4 6 6 21 3
D 6 0 42 9859 0 6 53 6 4 1
C 1 1 0 0 9973 0 0 0 1 1
Q 3 9 4 5 0 9876 27 1 23 1
E 10 0 7 56 0 35 9865 4 2 3
G 21 1 12 11 1 3 7 9935 1 0
H 1 8 18 3 1 20 1 0 9912 0
I 2 2 3 1 2 1 2 0 0 9872
66. PAM1: Transition Matrix
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met
Phe Pro Ser Thr Trp Tyr Val
A R N D C Q E G H I L K M F P S T W Y V
Ala A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18
Arg R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1
Asn N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1
Asp D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1
Cys C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2
Gln Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1
Glu E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2
Gly G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5
His H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1
Ile I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33
Leu L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15
Lys K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1
Met M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4
Phe F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0
Pro P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2
Ser S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2
Thr T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9
Trp W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0
Tyr Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1
Val V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901
67. Numbers of accepted point mutations (x10)
accumulated from closely related
sequences.
Fractional exchanges result when ancestral
sequences are ambiguous: the
probabilities are distributed equally
among all possibilities.
The total number of exchanges tallied was
1,572. Note that 36 exchanges were
never observed.
The Asp-Glu pair had the largest number of
exchanges
PAM1: Transition Matrix
68. Second step: Frequencies of Occurence
If the properties of amino acids differ and if
they occur with different frequencies, all
statements we can make about the average
properties of sequences will depend on the
frequencies of occurence of the individual
amino acids. These frequencies of
occurence are approximated by the
frequencies of observation. They are the
number of occurences of a given amino acid
divided by the number of amino-acids
observed.
The sum of all is one.
Principles of Scoring Matrix Construction
69. Amino acid frequencies
1978 1991
L 0.085 0.091
A 0.087 0.077
G 0.089 0.074
S 0.070 0.069
V 0.065 0.066
E 0.050 0.062
T 0.058 0.059
K 0.081 0.059
I 0.037 0.053
D 0.047 0.052
R 0.041 0.051
P 0.051 0.051
N 0.040 0.043
Q 0.038 0.041
F 0.040 0.040
Y 0.030 0.032
M 0.015 0.024
H 0.034 0.023
C 0.033 0.020
W 0.010 0.014
Second step: Frequencies of Occurence
70. Third step: Relative Mutabilities
• To obtain a complete picture of the
mutational process, the amino-acids that
do not mutate must be taken into account
too.
• We need to know: what is the chance, on
average, that a given amino acid will
mutate at all. This is the relative
mutability of the amino acid.
• It is obtained by multiplying the number
of observed changes by the amino acids
frequency of occurence.
Principles of Scoring Matrix Construction
71. Compute amino acid mutability, mj, i.e., the propability
of a given amino acid, j, to be replaced.
Aligned A D A
Sequences A D B
Amino Acids A B D
Observed Changes 1 1 0
Frequency of Occurence 3 1 2
(Total Composition)
Relative Mutability .33 1 0
Overview
72. 1978 1991
A 100 100
C 20 44
D 106 86
E 102 77
F 41 51
G 49 50
H 66 91
I 96 103
K 56 72
L 40 54
M 94 93
N 134 104
P 56 58
Q 93 84
R 65 83
S 120 117
T 97 107
V 74 98
W 18 25
Y 41 50
Principles of Scoring Matrix Construction
73. Fourth step: Mutation Probability Matrix
• With these data the probability that an amino acid in
row i of the matrix will replace the amino acid in
column j can be calculated: it is the mutability of amino
acid j, multiplied by the relative pair exchange
frequency (the pair exchange frequency for ij divided
by the sum of all pair exchange frequencies for amino
acid i).
Mij= The mutation probability matrix gives the
probability, that an amino acid i will replace an amino
acid of type j in a given evolutionary interval, in two
related sequences
Principles of Scoring Matrix Construction
ADB
ADA
A D B
A
D
B
i
j
74. Fifth step: The Evolutionary Distance
• Since the represent the probabilites
for amino acids to remain
conserved, if we scale all cells of our
matrix by a constant factor we can
scale the matrix to reflect a specific
overall probability of change. We
may chose so that the expected
number of changes is 1 %, this
gives the matrix for the evolutionary
distance of 1 PAM.
Principles of Scoring Matrix Construction
75. 6. Relatedness Odds
• By comparison, the probability that
that same event is observed by
random chance is simply given by
the frequency of occurence of
amino acid i
• Rij = probability that j replaces i in
related proteins
• Pi
ran = probability that j replaces I by
chance (eg unrelated proteins)
• Pi
ran = fi = the frequency of
occurance of amino acid i
Principles of Scoring Matrix Construction
76. Last step: the log-odds matrix
• Since multiplication is a computationally
expensive process, it is preferrable to add
the logarithms of the matrix elements. This
matrix, the log odds matrix, is the
foundation of quantitative sequence
comparisons under an evolutionary model.
• Since the Dayhoff matrix was taken as the
log to base 10, a value of +1 would mean
that the corresponding pair has been
observed 10 times more frequently than
expected by chance. A value of -0.2 would
mean that the observed pair was observed
1.6 times less frequently than chance
would predict.
Principles of Scoring Matrix Construction
78. A B C D E F G H I K L M N P Q R S T V W Y Z
0.4 0.0 -0.4 0.0 0.0 -0.8 0.2 -0.2 -0.2 -0.2 -0.4 -0.2 0.0 0.2 0.0 -0.4 0.2 0.2 0.0 -1.2 -0.6 0.0 A
0.5 -0.9 0.6 0.4 -1.0 0.1 0.3 -0.4 0.1 -0.7 -0.5 0.4 -0.2 0.3 -0.1 0.1 0.0 -0.4 -1.1 -0.6 0.4 B
2.4 -1.0 -1.0 -0.8 -0.6 -0.6 -0.4 -1.0 -1.2 -1.0 -0.8 -0.6 -1.0 -0.8 0.0 -0.4 -0.4 -1.6 0.0 -1.0 C
0.8 0.6 -1.2 0.2 0.2 -0.4 0.0 -0.8 -0.6 0.4 -0.2 0.4 -0.2 0.0 0.0 -0.4 -1.4 -0.8 0.5 D
0.8 -1.0 0.0 0.2 -0.4 0.0 -0.6 -0.4 0.2 -0.2 0.4 -0.2 0.0 0.0 -0.4 -1.4 -0.8 0.6 E
1.8 -1.0 -0.4 0.2 -1.0 0.4 0.0 -0.8 -1.0 -1.0 -0.8 -0.6 -0.6 -0.2 0.0 1.4 -1.0 F
1.0 -0.4 -0.6 -0.4 -0.8 -0.6 0.0 -0.2 -0.2 -0.6 0.2 0.0 -0.2 -1.4 -1.0 -0.1 G
1.2 -0.4 0.0 -0.4 -0.4 0.4 0.0 0.6 0.4 -0.2 -0.2 -0.4 -0.6 0.0 -0.4 H
1.0 -0.4 0.4 0.4 -0.4 -0.4 -0.4 -0.4 -0.2 0.0 0.8 -1.0 -0.2 -0.4 I
1.0 -0.6 0.0 0.2 -0.2 0.2 0.6 0.0 0.0 -0.4 -0.6 -0.8 0.1 K
1.2 0.8 -0.6 -0.6 -0.4 -0.6 -0.6 -0.4 0.4 -0.4 -0.2 -0.5 L
1.2 -0.4 -0.4 -0.2 0.0 -0.4 -0.2 0.4 -0.8 -0.4 -0.3 M
0.4 -0.2 0.2 0.0 0.2 0.0 -0.4 -0.8 -0.4 0.2 N
1.2 0.0 0.0 0.2 0.0 -0.2 -1.2 -1.0 -0.1 P
0.8 0.2 -0.2 -0.2 -0.4 -1.0 -0.8 0.6 Q
1.2 0.0 -0.2 -0.4 0.4 -0.8 0.6 R
0.4 0.2 -0.2 -0.4 -0.6 -0.1 S
0.6 0.0 -1.0 -0.6 -0.1 T
0.8 -1.2 -0.4 -0.4 V
3.4 0.0 -1.2 W
2.0 -0.8 Y
0.6 Z
PAM 1 Scoring Matrix
79. • Some of the properties go into the
makeup of PAM matrices are - amino
acid residue size, shape, local
concentrations of electric charge, van
der Waals surface, ability to form salt
bridges, hydrophobic interactions, and
hydrogen bonds.
– These patterns are imposed principally
by natural selection and only secondarily
by the constraints of the genetic code.
– Coming up with one’s own matrix of
weights based on some logical features
may not be very successful because your
logical features may have been over-
written by other more important
considerations.
Overview
80. • Two aspects of this process cause the
evolutionary distance to be unequal in
general to the number of observed
differences between the sequences:
– First, there is a chance that a certain
residue may have mutated, than reverted,
hiding the effect of the mutation.
– Second, specific residues may have
mutated more than once, thus the number
of point mutations is likely to be larger
than the number of differences between
the two sequences..
Principles of Scoring Matrix Construction
82. • Initialize:
– Generate Random protein (1000 aa)
• Simulate evolution (eg 250 for PAM250)
– Apply PAM1 Transition matrix to each amino
acid
– Use Weighted Random Selection
• Iterate
– Measure difference to orginal protein
Experiment: pam-simulator.pl
83. Dayhoff’s PAM1 mutation probability matrix (Transition Matrix)
A
Ala
R
Arg
N
Asn
D
Asp
C
Cys
Q
Gln
E
Glu
G
Gly
H
His
I
Ile
A 9867 2 9 10 3 8 17 21 2 6
R 1 9913 1 0 1 10 0 0 10 3
N 4 1 9822 36 0 4 6 6 21 3
D 6 0 42 9859 0 6 53 6 4 1
C 1 1 0 0 9973 0 0 0 1 1
Q 3 9 4 5 0 9876 27 1 23 1
E 10 0 7 56 0 35 9865 4 2 3
G 21 1 12 11 1 3 7 9935 1 0
H 1 8 18 3 1 20 1 0 9912 0
I 2 2 3 1 2 1 2 0 0 9872
87. PAM Value Distance(%)
80 50
100 60
200 75
250 85 <- Twilight zone
300 92
(From Doolittle, 1987, Of URFs and ORFs,
University Science Books)
Some PAM values and their corresponding observed distances
•When the PAM distance value between two distantly related proteins nears the value 250 it
becomes difficult to tell whether the two proteins are homologous, or that they are two at
randomly taken proteins that can be aligned by chance. In that case we speak of the 'twilight
zone'.
•The relation between the observed percentage in distance of two sequences versus PAM
value. Two randomly diverging sequences change in a negatively exponential fashion. After
the insertion of gaps to two random sequences, it can be expected that they will be 80 - 90 %
dissimilar (from Doolittle, 1987 ).
88. • Creation of a pam series from evolutionary
simulations
• pam2=pam1^2
• pam3=pam1^3
• And so on…
• pam30,60,90,120,250,300
• low pam - closely related sequences
– high scores for identity and low scores for
substitutions - closer to the identity matrix
• high pam - distant sequences
– at pam2000 all information is degenerate except
for cysteins
• pam250 is the most popular and general
– one amino acid in five remains unchanged
(mutability varies among the amino acids)
Overview
89.
90. 250 PAM evolutionary distance
A R N D C Q E G H I L K M F P
Ala A 13 6 9 9 5 8 9 12 6 8 6 7 7 4 11
Arg R 3 17 4 3 2 5 3 2 6 3 2 9 4 1 4
Asn N 4 4 6 7 2 5 6 4 6 3 2 5 3 2 4
Asp D 5 4 8 11 1 7 10 5 6 3 2 5 3 1 4
Cys C 2 1 1 1 52 1 1 2 2 2 1 1 1 1 2
Gln Q 3 5 5 6 1 10 7 3 7 2 3 5 3 1 4
Glu E 5 4 7 11 1 9 12 5 6 3 2 5 3 1 4
Gly G 12 5 10 10 4 7 9 27 5 5 4 6 5 3 8
His H 2 5 5 4 2 7 4 2 15 2 2 3 2 2 3
Ile I 3 2 2 2 2 2 2 2 2 10 6 2 6 5 2
Leu L 6 4 4 3 2 6 4 3 5 15 34 4 20 13 5
Lys K 6 18 10 8 2 10 8 5 8 5 4 24 9 2 6
Met M 1 1 1 1 0 1 1 1 1 2 3 2 6 2 1
Phe F 2 1 2 1 1 1 1 1 3 5 6 1 4 32 1
Pro P 7 5 5 4 3 5 4 5 5 3 3 4 3 2 20
Ser S 9 6 8 7 7 6 7 9 6 5 4 7 5 3 9
Thr T 8 5 6 6 4 5 5 6 4 6 4 6 5 3 6
Trp W 0 2 0 0 0 0 0 0 1 0 1 0 0 1 0
Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1 2 15 1
Val V 7 4 4 4 4 4 4 4 5 4 15 10 4 10 5
[column on left represents the replacement amino acid]
Mutation probability matrix for the evolutionary distance of 250 PAMs. To
simplify the appearance, the elements are shown multiplied by 100.
In comparing two sequences of average amino acid frequency at this
evolutionary distance, there is a 13% probability that a position
containing Ala in the first sequence will contain Ala in the second.
There is a 3% chance that it will contain Arg, and so forth.
Overview
91. 4 3 2 1 0
A brief history of time (BYA)
Origin of
life
Origin of
eukaryotes insects
Fungi/animal
Plant/animal
Earliest
fossils
BYA
92. Margaret Dayhoff’s 34 protein superfamilies
Protein PAMs per 100 million years
Ig kappa chain 37
Kappa casein 33
Lactalbumin 27
Hemoglobin 12
Myoglobin 8.9
Insulin 4.4
Histone H4 0.10
Ubiquitin 0.00
93. Many sequences depart from average
composition.
Rare replacements were observed too
infrequently to resolve relative
probabilities accurately (for 36 pairs no
replacements were observed!).
Errors in 1PAM are magnified in the
extrapolation to 250PAM.
Distantly related sequences usually
have islands (blocks) of conserved
residues. This implies that replacement
is not equally probable over entire
sequence.
Sources of error
94. • Simple identity, which scores only identical amino
acids as a match.
• Genetic code changes, which scores the
minimum number of nucieotide changes to change
a codon for one amino acid into a codon for the
other.
• Chemical similarity of amino acid side
chains, which scores as a match two amino acids
which have a similar side chain, such as
hydrophobic, charged and polar amino acid groups.
• The Dayhoff percent accepted mutation (PAM)
family of matrices, which scores amino acid pairs
on the basis of the expected frequency of
substitution of one amino acid for the other during
protein evolution.
• The blocks substitution matrix (BLOSUM) amino
acid substitution tables, which scores amino acid
pairs based on the frequency of amino acid
substitutions in aligned sequence motifs called
blocks which are found in protein families
Overview
95. • Henikoff & Henikoff (Henikoff, S. &
Henikoff J.G. (1992) PNAS 89:10915-
10919)
• asking about the relatedness of distantly
related amino acid sequences ?
• They use blocks of sequence fragments
from different protein families which can
be aligned without the introduction of
gaps. These sequence blocks correspond
to the more highly conserved regions.
BLOSUM: Blocks Substitution Matrix
96. BLOSUM (BLOck – SUM) scoring
DDNAAV
DNAVDD
NNVAVV
Block = ungapped alignent
Eg. Amino Acids D N V A
a b c d e f
1
2
3
S = 3 sequences
W = 6 aa
N= (W*S*(S-1))/2 = 18 pairs
97. A. Observed pairs
DDNAAV
DNAVDD
NNVAVV
a b c d e f
1
2
3
D N A V
D
N
A
V
1
4
1
3
1
1
1
1
4 1
f fij
D N A V
D
N
A
V
.056
.222
.056
.167
.056
.056
.056
.056
.222 .056
gij
/18
Relative frequency table
Probability of obtaining a pair
if randomly choosing pairs
from block
98. AB. Expected pairs
DDDDD
NNNN
AAAA
VVVVV
DDNAAV
DNAVDD
NNVAVV
Pi
5/18
4/18
4/18
5/18
P{Draw DN pair}= P{Draw D, then N or Draw M, then D}
P{Draw DN pair}= PDPN + PNPD = 2 * (5/18)*(4/18) = .123
D N A V
D
N
A
V
.077
.123
.154
.123
.049
.123
.099
.049
.123 .049
eijRandom rel. frequency table
Probability of obtaining a pair of
each amino acid drawn
independently from block
99. C. Summary (A/B)
sij = log2 gij/eij
(sij) is basic BLOSUM score matrix
Notes:
• Observed pairs in blocks contain information about
relationships at all levels of evolutionary distance
simultaneously (Cf: Dayhoffs’s close relationships)
• Actual algorithm generates observed + expected pair
distributions by accumalution over a set of approx. 2000
ungapped blocks of varrying with (w) + depth (s)
100. • blosum30,35,40,45,50,55,60,62,65,70,75,80,85,90
• transition frequencies observed directly by identifying
blocks that are at least
– 45% identical (BLOSUM-45)
– 50% identical (BLOSUM-50)
– 62% identical (BLOSUM-62) etc.
• No extrapolation made
• High blosum - closely related sequences
• Low blosum - distant sequences
• blosum45 pam250
• blosum62 pam160
• blosum62 is the most popular matrix
The BLOSUM Series
102. • Church of the Flying Spaghetti Monster
• http://www.venganza.org/about/open-letter
103. • Which matrix should I use?
– Matrices derived from observed substitution data
(e.g. the Dayhoff or BLOSUM matrices) are
superior to identity, genetic code or physical
property matrices.
– Schwartz and Dayhoff recommended a mutation
data matrix for the distance of 250 PAMs as a
result of a study using a dynamic programming
procedure to compare a variety of proteins known
to be distantly related.
• The 250 PAM matrix was selected since in Monte
Carlo studies matrices reflecting this evolutionary
distance gave a consistently higher significance
score than other matrices in the range 0.750 PAM.
The matrix also gave better scores when compared
to the genetic code matrix and identity scoring.
Overview
104. • When comparing sequences that were not
known in advance to be related, for
example when database scanning:
– default scoring matrix used is the
BLOSUM62 matrix
– if one is restricted to using
only PAM scoring matrices, then
the PAM120 is recommended for
general protein similarity searches
• When using a local alignment
method, Altschul suggests that three
matrices should ideally be used:
PAM40, PAM120 and PAM250, the lower
PAM matrices will tend to find short
alignments of highly similar
sequences, while higher PAM matrices will
find longer, weaker local alignments.
Which matrix should I use?
106. – Henikoff and Henikoff have compared the
BLOSUM matrices to PAM by evaluating how
effectively the matrices can detect known members
of a protein family from a database when searching
with the ungapped local alignment program
BLAST. They conclude that overall the BLOSUM
62 matrix is the most effective.
• However, all the substitution matrices investigated
perform better than BLOSUM 62 for a proportion of
the families. This suggests that no single matrix is
the complete answer for all sequence comparisons.
• It is probably best to compliment the BLOSUM 62
matrix with comparisons using 250 PAMS, and
Overington structurally derived matrices.
– It seems likely that as more protein three
dimensional structures are determined, substitution
tables derived from structure comparison will give
the most reliable data.
Overview
108. Dotplots
• What is it ?
– Graphical representation using two orthogonal
axes and “dots” for regions of similarity.
– In a bioinformatics context two sequence are
used on the axes and dots are plotted when a
given treshold is met in a given window.
• Dot-plotting is the best way to see all of the
structures in common between two
sequences or to visualize all of the repeated
or inverted repeated structures in one
sequence
109. Dot Plot References
Gibbs, A. J. & McIntyre, G. A. (1970).
The diagram method for comparing sequences. its use with
amino acid and nucleotide sequences.
Eur. J. Biochem. 16, 1-11.
Staden, R. (1982).
An interactive graphics program for comparing and aligning
nucleic-acid and amino-acid sequences.
Nucl. Acid. Res. 10 (9), 2951-2961.
110. Visual Alignments (Dot Plots)
• Matrix
– Rows: Characters in one sequence
– Columns: Characters in second sequence
• Filling
– Loop through each row; if character in row, col match, fill
in the cell
– Continue until all cells have been examined
113. Noise in Dot Plots
• Nucleic Acids (DNA, RNA)
– 1 out of 4 bases matches at random
• Stringency
– Window size is considered
– Percentage of bases matching in the window is
set as threshold
114. Reduction of Dot Plot Noise
Self alignment of ACCTGAGCTCACCTGAGTTA
117. • Regions of similarity appear
as diagonal runs of dots
• Reverse diagonals
(perpendicular to diagonal)
indicate inversions
• Reverse diagonals crossing
diagonals (Xs) indicate
palindromes
• A gap is introduced by each
vertical or horizontal skip
Overview
118. • Window size changes with goal
of analysis
– size of average exon
– size of average protein structural
element
– size of gene promoter
– size of enzyme active site
Overview
119. Rules of thumb
Don't get too many points, about 3-
5 times the length of the sequence
is about right (1-2%)
Window size about 20 for distant
proteins 12 for nucleic acid
Check sequence vs. itself
Check sequence vs. sequence
Anticipate results
(e.g. “in-house” sequence vs
genomic, question)
Overview
120. Available Dot Plot Programs
Dotlet (Java Applet)
http://www.isrec.isb-
sib.ch/java/dotlet/Dotlet.
html
121. Available Dot Plot Programs
Dotter (http://www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html)
123. Weblems
• W3.1: Why does 2 PAM, i.e. 1 PAM multiplied with itself,
not correspond to exactly 2% of the amino acids having
mutated, but a little less than 2% ? Or, in other words, why
does a 250 PAM matrix not correspond to 250% accepted
mutations ?
• W3.2: Is it biologically plausible that the C-C and W-W
entries in the scoring matrices are the most prominent ?
Which entries (or groups of entries) are the least prominent ?
• W3.3: What is OMIM ? How many entries are there ? What
percentage of OMIM listed diseases has no known (gene)
cause ?
• W3.4: Pick one disease mapped to chromosome Y from
OMIM where only a mapping region is known. How many
candidate genes can you find in the locus using ENSEMBL ?
Can you link ontology terms for the candidates to the disease
phenotype ?