Bioinformatics and
Phylogenetic Analysis
Edgar Scott
Multicampus Bioinformatics
Education Specialist
What is Bioinformatics
 Interdisciplinary field that combines
principles and techniques from
computer science, probability and
statistics, and linguistics to the study of
genomic and proteomic sequences.
 Biological database for storing and
organizng DNA and protein sequences
 Computational tools for analyzing
sequences
Phylogenetic Analysis and
Bioinformatics
 Phylogenetics – study of evolutionary
relationships
 Phylogenetic trees used to represent
evolutionary relationships
 Use of protein or DNA sequences to detect
relationships versus morphological characters
 Bioinformatics provides both sequence
repositories and sequence analysis software.
Overview
 Acquiring Data Set
 Text searching at the National Center for
Biotechnology Information (NCBI)
 Sequence similarity and homology
 Sequence similarity searching with Basic Local
Alignment Search Tool (BLAST)
 Analyzing Data Set
 Phylogenetic Analysis with Molecular Evolutionary
Genetics Analysis (MEGA) 3.1 software
 Build multiple sequence alignments of sequences using
ClustalW
 Build phylogenetic trees
Text Searching at NCBI
 NCBI maintains provides molecular
information and bioinformatic tools to
the scientific community
 GenBank – an archival DNA and protein
sequence database
 RefSeq – a curated DNA and protein
sequence database
 Entrez Gene – a gene centered database
Sequence Similarity and
Homology
 Homology – sequence that share a common
ancestral sequence
 Paralogs – arise via gene duplication
 Orthologs – arise via speciation event
 Xenologs – arise via gene transfer
 Evolutionarily related sequences have similar
sequences.
 Sequence differences correspond to amount
of change that has occurred since they last
shared a common ancestral sequence.
Sequence Alignments
 Sequence Alignment – a process that identifies a
series of characters or character patterns that are in
the same order in both sequences.
 Pairwise Global alignment
 Pairwise Local alignment
 Optimal alignment – an alignment between
sequences in which the number of matching
characters are maximized and the mismatching
characters are minimized.
 Quantifying alignments
 Alignment score of the optimal alignment
 Percent identity scores
 Percent similarity scores
Sequence Similarity Searching
 Basic Local Alignment Search Tool (BLAST)
 Blastp, Blastn, Blastx, Tblastn, & TblastX
 Local alignments are reported
 Expectation Value – the number of times an
investigator can expect to find an alignment
that has an alignment score as good or better
than the alignment score under consideration.
Steps to Build a Tree
 Build a multiple sequence alignment of
data set.
 Analyze multiple sequence alignment
using either distance based methods or
character based methods.
Molecular Evolutionary
Genetics Analysis (MEGA) 3.1
 Phylogenetic Analysis program
 Constructs multiple sequence alignment using
ClustalW
 Provides tree building methods
 Distance based Methods
 UPGMA
 Neighbor-joining method
 Minimum Evolution
 Character based Method
 Maximum Parsimony
 Provides a great help document!
Multiple Sequence Alignment
 Multiple Sequence Alignment – an alignment
between three or more sequences.
 Computationally classified as NP-hard
 Programs
 ClustalW – fast, applies a progressive method
 T-Coffee – slower, applies an advanced
progressive method
 Dialign – slow, applies an iterative method
 Combine – combines multiple sequence
alignments
Tree Building methods
 UPGMA, Neighbor-Joining, Minimum Evolution
 Distance based methods
 Analyze the multiple sequence alignment to
calculate a distance matrix.
 Clustering algorithm analyzes the distance matrix
to determine which sequences should be
clustered.
 Maximum parsimony
 Character based method
 Analyze the multiple sequence alignment to create
a tree whose tree length has been minimized.
Tree Reliability
 Bootstrapping – method for assessing
the reliability of trees.
 Steps
 The original data set is resampled several
times (e.g. 1000).
 For each resampling, a tree is built
 The trees created from the resampling
iterations are compared to the original
tree.
Review
 Acquiring Data Set
 Text searching at the National Center for
Biotechnology Information (NCBI)
 Sequence similarity and homology
 Sequence similarity searching with Basic Local
Alignment Search Tool (BLAST)
 Analyzing Data Set
 Phylogenetic Analysis with Molecular Evolutionary
Genetics Analysis (MEGA) 3.1 software
 Build multiple sequence alignments of sequences using
ClustalW
 Build phylogenetic trees

BIOINFORMATICS_AND_PHYLOGENY.pdf.pdf

  • 1.
    Bioinformatics and Phylogenetic Analysis EdgarScott Multicampus Bioinformatics Education Specialist
  • 2.
    What is Bioinformatics Interdisciplinary field that combines principles and techniques from computer science, probability and statistics, and linguistics to the study of genomic and proteomic sequences.  Biological database for storing and organizng DNA and protein sequences  Computational tools for analyzing sequences
  • 3.
    Phylogenetic Analysis and Bioinformatics Phylogenetics – study of evolutionary relationships  Phylogenetic trees used to represent evolutionary relationships  Use of protein or DNA sequences to detect relationships versus morphological characters  Bioinformatics provides both sequence repositories and sequence analysis software.
  • 4.
    Overview  Acquiring DataSet  Text searching at the National Center for Biotechnology Information (NCBI)  Sequence similarity and homology  Sequence similarity searching with Basic Local Alignment Search Tool (BLAST)  Analyzing Data Set  Phylogenetic Analysis with Molecular Evolutionary Genetics Analysis (MEGA) 3.1 software  Build multiple sequence alignments of sequences using ClustalW  Build phylogenetic trees
  • 5.
    Text Searching atNCBI  NCBI maintains provides molecular information and bioinformatic tools to the scientific community  GenBank – an archival DNA and protein sequence database  RefSeq – a curated DNA and protein sequence database  Entrez Gene – a gene centered database
  • 6.
    Sequence Similarity and Homology Homology – sequence that share a common ancestral sequence  Paralogs – arise via gene duplication  Orthologs – arise via speciation event  Xenologs – arise via gene transfer  Evolutionarily related sequences have similar sequences.  Sequence differences correspond to amount of change that has occurred since they last shared a common ancestral sequence.
  • 7.
    Sequence Alignments  SequenceAlignment – a process that identifies a series of characters or character patterns that are in the same order in both sequences.  Pairwise Global alignment  Pairwise Local alignment  Optimal alignment – an alignment between sequences in which the number of matching characters are maximized and the mismatching characters are minimized.  Quantifying alignments  Alignment score of the optimal alignment  Percent identity scores  Percent similarity scores
  • 8.
    Sequence Similarity Searching Basic Local Alignment Search Tool (BLAST)  Blastp, Blastn, Blastx, Tblastn, & TblastX  Local alignments are reported  Expectation Value – the number of times an investigator can expect to find an alignment that has an alignment score as good or better than the alignment score under consideration.
  • 9.
    Steps to Builda Tree  Build a multiple sequence alignment of data set.  Analyze multiple sequence alignment using either distance based methods or character based methods.
  • 10.
    Molecular Evolutionary Genetics Analysis(MEGA) 3.1  Phylogenetic Analysis program  Constructs multiple sequence alignment using ClustalW  Provides tree building methods  Distance based Methods  UPGMA  Neighbor-joining method  Minimum Evolution  Character based Method  Maximum Parsimony  Provides a great help document!
  • 11.
    Multiple Sequence Alignment Multiple Sequence Alignment – an alignment between three or more sequences.  Computationally classified as NP-hard  Programs  ClustalW – fast, applies a progressive method  T-Coffee – slower, applies an advanced progressive method  Dialign – slow, applies an iterative method  Combine – combines multiple sequence alignments
  • 12.
    Tree Building methods UPGMA, Neighbor-Joining, Minimum Evolution  Distance based methods  Analyze the multiple sequence alignment to calculate a distance matrix.  Clustering algorithm analyzes the distance matrix to determine which sequences should be clustered.  Maximum parsimony  Character based method  Analyze the multiple sequence alignment to create a tree whose tree length has been minimized.
  • 13.
    Tree Reliability  Bootstrapping– method for assessing the reliability of trees.  Steps  The original data set is resampled several times (e.g. 1000).  For each resampling, a tree is built  The trees created from the resampling iterations are compared to the original tree.
  • 14.
    Review  Acquiring DataSet  Text searching at the National Center for Biotechnology Information (NCBI)  Sequence similarity and homology  Sequence similarity searching with Basic Local Alignment Search Tool (BLAST)  Analyzing Data Set  Phylogenetic Analysis with Molecular Evolutionary Genetics Analysis (MEGA) 3.1 software  Build multiple sequence alignments of sequences using ClustalW  Build phylogenetic trees