SlideShare a Scribd company logo
1 of 12
Re-construction of Phylogenetic tree
using maximum-likelihood methods
PhyML (in nutshell)
Note: Slides are still under revision
Steps
• Collect homologous sequences.
• Multiple sequence alignment.
• Manually Curing of the multiple sequence alignment.
• Feeding the MSA to programs to study the substitution
rates in between locations of the sites in the MSA.
(ProtTest for protein and jModeltest for DNA alignments).
• Selecting an appropriate substitution model.
• Feeding the MSA, starting tree (e.g., those obtained with
Neighbour-joining method) and substitution model as well
as bootstrap properties to PhyML.
• Obtain tree and cross-check bootstrap values, branch
length and general resolution.
• Remove rouge taxons and redo the entire process till
satisfactory tree is constructed.
Selection of sequences for phylogenetic tree
Purpose of the tree
1.Geneology: evolution of gene/ gene family irrespective of
speciation (called gene tree).
2.Phenology: evolution of gene/gene family in context of
phylogenetic speciation (called species tree).
Homologues: Genes derived from common ancestors.
Orthologues: Genes derived from common ancestors or
homologues that are separated from each other by
gene/genome duplication (of course before speciation).
Paralogues: Genes derived from common ancestors or
homologues that are separated from one another by
speciation (i.e., after speciation occurs the same copy of gene
evolves under different constraints that are face by the two
different species.
Selecting sequences
•Similar sequence of considerably low e-value in BLAST in
general can be assigned to be homologous.
•<40% amino acid similarity = higher by-chance appearance of
similarity and not necessarily a similairity due to homology
•~40% amino acid similarity= twilight zone for homology (may
be may not be)
•≥60% amino acid similarity=homology inferred
(~80% or higher similarity in DNA sequence.)
• Perform BLAST of the new sequence.
• Note the hits obtained and the e-value.
• Follow the sequences down the list with increasing e-values till the e-
value suddenly jumps in order of 3 or so. E.g. 1e-10 means that the
possibility that the sequence similarity is having a by-chance occurance is
in probablity of 1x 10-10
and not due to homology. A sudden jump from 1e-
10 to 1 e-5 in the similarity sequence BLAST result list may indicate that
the homology may be limited till the sequences with lower e-value.
(Note: e-value is subjected to the size of the sequence database. larger
database have lower starting e-values for a given query sequence)
• Note the annotation or characterization of the proteins encoded as well
as the % similarity and sequence coverage.
• Also note the organisms from which it is derived
• Select sequences with considerable coverage and similarity for multiple
sequence alignment.
• The choice of sequence can be based on species of origin and their
relatedness or on special activities and multiple domain structures
depending on what basis the phylogeny is to be re-constructed.
MSA- Multiuple Sequence Alignment
Different types eg., CLUSTAL, DiALIGN, MUSCLE, MAFFT.
THEORETICALY ANY SEQUENCE CAN BE ALIGNED TO ANY OTHER SEQUENCE>
WHETHER IT MAKES SENSE OR NOT IS A DIFFERENT ISSUE.
CLUSTAL (CLUSTALW2, X): ClustalW2 uses a dynamic programing method to make
MSA based on Hidden-Markov models (HMM) of probalistic likelihoods of all gaps,
matches and mismatches to be aligned into a biologically relevant MSA. The dynamic
programing stepwise finds the highest score of MSA based on cumulative scores by
matches at each base and penalizing scores due to mismatches. This stepwise scoring
is decided in first a pairwise matrix choosing the shortest distance to higher scores in
situations where gaps are observed. (more info on internet will be available). This
reduces greatly the time required for analysis.
DiALIGN: Dialign which does not use gap penalizing and thus can be used for more
accurate alignment of very divergent sequences that suffer large alignment gaps.
MUSCLE: MUSCLE (Multiple Sequence Alignment by Log-Expectation) rely on
interative methods that involve repeatedly aligning the old sequences while adding
newer to the growing MSA to produce more accurate alignments in shorter time
frames.
CLUSTAL (CLUSTALX):
•Feed sequence in fasta format (copy paste on the applet or attach a
notepad file {*.txt}).
E.g., > (name of the 1st
sequence)
Agtgatagatag…………
>(name of the 2nd
sequence)
Gatagatcgctgatcgctc…..
•Run with default.
•Analyze
Gaps are frequent: change the settings such that gap
opening penalty is high e.g. increase from the default value
of 10 to 15, 20, 25, 30.
Gaps are long but less frequent: change settings such that
gap extension penalty is high e.g., increase from default
value of 1 to 2,3,4,5
No gaps but many mismatches: relax the gap opening (5,
6, 7,) and/or gap extension penalty (0.1, 0.2, 0.4, 0.5) such
that indels might occur in the data set for a better match.
REDO THE MSA ALIGNMENT TILL IT IS better.
Manual curing of MSA
•Involves intellectual curing of usually the placement of alignment gaps
among the sequence alignment. This is understood more appropriately in
case to case study.
•Involves the removal of rouge taxons. i.e., the sequence that do not fit in
the current MSA due to dis-proportionate accurence of mismatches and
gaps. Usually it can be figured out after the first tree is made and the
bootstrapping values and/or branch lengths of the particular lineages is
questionable. (appropriate software are available).
•Larger the sequence set the higher the accuracy of the tree. But also more
time consuming is tree construction by maximum likelihood (ML).
•More diverse the sequence set more erroneous the tree may be since it
would be an approximation. Hence closely similar sequences
representatives from each ordered data set needs to be selected. For eg.,
when talking of small molecule methyl transferases one may take a few
close relatives of O-, N-, C- methyl transferases for analysis since these
have considerable phylogenetic homology.
Substitution model
•The curated MSA can be included as an input to programs like jModeltest for DNA and
Prottest for proteins to the pattern of substitution at each site in the MSA. Based on this
pattern a list of appropriate substitution model for anaylsis is calculated. For eg. The
simplest model Jukes-Cantor (JC) says that each base of DNA can be substituted at equal
rate to other base in evolution. Though it is unrealistic in the practicality of life but the
sequences selected might just anticipated to be obliging to this rate and thus JC can be
used for analysis in PhyML. Kimura model says that transitions (Ts) (or purine to purine and
pyrimidine to pyrimidine changes) and transversions (Tv) (purine to pyrimidine or vice
versa) changes occur at different rates.
•There are 22 DNA substitution models published and each model can have slight variants
based on statistical distribution of variables like +I + G and +Y thus making it a total of
22*4=88 substitution model for DNA substitution.
•+I: refers to proportion of invariable sites. (invariable sites refers to the bias incorporated
due to substitution and rate heterogeneity amongst different lineages).
Inclusion of this parameter ensures that the bias of sequence dissimilarity due to sequence
relatedness id reduced.
•+g: refers to gamma distribution of the matrix (gamma distribution is a pattern/shaape
that is obserevd during statistical distribution of variants).
•+y: refers to distribution or accounting for Ts/Tv ratio (incorporated due to slight
variations observed between transition and transversion substitutions).
e.g., MSA can follow a JC model or JC+I or JC+G or JC+Y
Substitution model
•The decision of what substitution model depends on three sattistical
considerations incorporated in both jModeltest and prottest. Akaike
Information Criteria (AIC), Bayesian Information criteria (BIC) and Akaike
Information Criteria corrected for small samples (AICc).
•The model having high scores for AIC and BIC are usually selected as
appropriate substitution models for phylogenetic estimation.
Phylogeny
PhyML at present incorporates analysis using 32 substitution models for
DNA.
After adding all the tested parameters like MSA, substitution models, + I/
+G/+Y parameter options the tree building can be carried out.
PhyML requires a strating user-define tree for building a phlylogenetic tree.
If not available PhyML can be commanded to construct by its own a
Neighbour-Joining starting tree.
The tree can be improved by selecting option like SPR +NNI so that
appropriateness in branch lengths can be incorporated.
Finally a bootstrapping for 1000 pseudoreplicates is choosen for accuracy
of branch topology.
Bootstrapping
Bootstrapping involves the program to perform the same
tree building with pseudoreplicates of the sequences
after breaking blocks of alignment and rearranging and
then calculation how many times per hundred
pseudoreplicates does a branch fall under the same
topology.
A bootstrap of greated than 70% is significant in general.
Higher amount of pseudoreplicates chooses the more
accurate is the topological calculations
A bootstrap pesudoreplicate of 1000 is preferable but in
consideration of time required pseudoreplicate of 100
also suffices.
Re-construction
•Once the tree is generated, the tree is broadly looked upon for
accuracy by bootstrap values of each branch as well as disproportionate
branch lengths.
•In case of faulty trees, corrections need to be made at both aspects.
•If the MSA is cured properly, then one might need to remove rogue
taxons (Taxons that are problematic to the tree topology or branch
length) using available softwares.
The entire process from searching for optimal substitution models
may needed to be repeated.
•If no rogue taxons can be identified. Reducing the generality of
sequence diversity could also be tried. And more relevant sequences
only be included in MSA.
•The NJ tree option can also be changed to a user defined tree option.
•The tree construction is repeated in a number of cycles untill
appropriate tree is generated.

More Related Content

What's hot

What's hot (20)

sequence of file formats in bioinformatics
sequence of file formats in bioinformaticssequence of file formats in bioinformatics
sequence of file formats in bioinformatics
 
Swiss prot database
Swiss prot databaseSwiss prot database
Swiss prot database
 
Sts
StsSts
Sts
 
Prosite
PrositeProsite
Prosite
 
Structural databases
Structural databases Structural databases
Structural databases
 
UPGMA
UPGMAUPGMA
UPGMA
 
PHYLOGENETICS WITH MEGA
PHYLOGENETICS WITH MEGAPHYLOGENETICS WITH MEGA
PHYLOGENETICS WITH MEGA
 
Comparitive genome mapping and model systems
Comparitive genome mapping and model systemsComparitive genome mapping and model systems
Comparitive genome mapping and model systems
 
UniProt
UniProtUniProt
UniProt
 
Orthologs,Paralogs & Xenologs
 Orthologs,Paralogs & Xenologs  Orthologs,Paralogs & Xenologs
Orthologs,Paralogs & Xenologs
 
Fasta
FastaFasta
Fasta
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Upgma
UpgmaUpgma
Upgma
 
Sage
SageSage
Sage
 
Molecular Evolution and Phylogenetics (2009)
Molecular Evolution and Phylogenetics (2009)Molecular Evolution and Phylogenetics (2009)
Molecular Evolution and Phylogenetics (2009)
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Genome sequencing
Genome sequencingGenome sequencing
Genome sequencing
 
Scop database
Scop databaseScop database
Scop database
 
Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
 
SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)SAGE (Serial analysis of Gene Expression)
SAGE (Serial analysis of Gene Expression)
 

Similar to Phylogenetic analysis in nutshell

Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
Abhishek Vatsa
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
Rai University
 
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
International Journal of Engineering Inventions www.ijeijournal.com
 

Similar to Phylogenetic analysis in nutshell (20)

Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
4. sequence alignment.pptx
4. sequence alignment.pptx4. sequence alignment.pptx
4. sequence alignment.pptx
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
 
Softwares For Phylogentic Analysis
Softwares For Phylogentic AnalysisSoftwares For Phylogentic Analysis
Softwares For Phylogentic Analysis
 
BTC 506 Phylogenetic Analysis.pptx
BTC 506 Phylogenetic Analysis.pptxBTC 506 Phylogenetic Analysis.pptx
BTC 506 Phylogenetic Analysis.pptx
 
Perl for Phyloinformatics
Perl for PhyloinformaticsPerl for Phyloinformatics
Perl for Phyloinformatics
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
phy prAC.pptx
phy prAC.pptxphy prAC.pptx
phy prAC.pptx
 
International Journal of Computer Science, Engineering and Information Techno...
International Journal of Computer Science, Engineering and Information Techno...International Journal of Computer Science, Engineering and Information Techno...
International Journal of Computer Science, Engineering and Information Techno...
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partii
 
Bioinformatica t4-alignments
Bioinformatica t4-alignmentsBioinformatica t4-alignments
Bioinformatica t4-alignments
 
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
 

Recently uploaded

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Lokesh Kothari
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 

Recently uploaded (20)

Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 

Phylogenetic analysis in nutshell

  • 1. Re-construction of Phylogenetic tree using maximum-likelihood methods PhyML (in nutshell) Note: Slides are still under revision
  • 2. Steps • Collect homologous sequences. • Multiple sequence alignment. • Manually Curing of the multiple sequence alignment. • Feeding the MSA to programs to study the substitution rates in between locations of the sites in the MSA. (ProtTest for protein and jModeltest for DNA alignments). • Selecting an appropriate substitution model. • Feeding the MSA, starting tree (e.g., those obtained with Neighbour-joining method) and substitution model as well as bootstrap properties to PhyML. • Obtain tree and cross-check bootstrap values, branch length and general resolution. • Remove rouge taxons and redo the entire process till satisfactory tree is constructed.
  • 3. Selection of sequences for phylogenetic tree Purpose of the tree 1.Geneology: evolution of gene/ gene family irrespective of speciation (called gene tree). 2.Phenology: evolution of gene/gene family in context of phylogenetic speciation (called species tree). Homologues: Genes derived from common ancestors. Orthologues: Genes derived from common ancestors or homologues that are separated from each other by gene/genome duplication (of course before speciation). Paralogues: Genes derived from common ancestors or homologues that are separated from one another by speciation (i.e., after speciation occurs the same copy of gene evolves under different constraints that are face by the two different species.
  • 4. Selecting sequences •Similar sequence of considerably low e-value in BLAST in general can be assigned to be homologous. •<40% amino acid similarity = higher by-chance appearance of similarity and not necessarily a similairity due to homology •~40% amino acid similarity= twilight zone for homology (may be may not be) •≥60% amino acid similarity=homology inferred (~80% or higher similarity in DNA sequence.)
  • 5. • Perform BLAST of the new sequence. • Note the hits obtained and the e-value. • Follow the sequences down the list with increasing e-values till the e- value suddenly jumps in order of 3 or so. E.g. 1e-10 means that the possibility that the sequence similarity is having a by-chance occurance is in probablity of 1x 10-10 and not due to homology. A sudden jump from 1e- 10 to 1 e-5 in the similarity sequence BLAST result list may indicate that the homology may be limited till the sequences with lower e-value. (Note: e-value is subjected to the size of the sequence database. larger database have lower starting e-values for a given query sequence) • Note the annotation or characterization of the proteins encoded as well as the % similarity and sequence coverage. • Also note the organisms from which it is derived • Select sequences with considerable coverage and similarity for multiple sequence alignment. • The choice of sequence can be based on species of origin and their relatedness or on special activities and multiple domain structures depending on what basis the phylogeny is to be re-constructed.
  • 6. MSA- Multiuple Sequence Alignment Different types eg., CLUSTAL, DiALIGN, MUSCLE, MAFFT. THEORETICALY ANY SEQUENCE CAN BE ALIGNED TO ANY OTHER SEQUENCE> WHETHER IT MAKES SENSE OR NOT IS A DIFFERENT ISSUE. CLUSTAL (CLUSTALW2, X): ClustalW2 uses a dynamic programing method to make MSA based on Hidden-Markov models (HMM) of probalistic likelihoods of all gaps, matches and mismatches to be aligned into a biologically relevant MSA. The dynamic programing stepwise finds the highest score of MSA based on cumulative scores by matches at each base and penalizing scores due to mismatches. This stepwise scoring is decided in first a pairwise matrix choosing the shortest distance to higher scores in situations where gaps are observed. (more info on internet will be available). This reduces greatly the time required for analysis. DiALIGN: Dialign which does not use gap penalizing and thus can be used for more accurate alignment of very divergent sequences that suffer large alignment gaps. MUSCLE: MUSCLE (Multiple Sequence Alignment by Log-Expectation) rely on interative methods that involve repeatedly aligning the old sequences while adding newer to the growing MSA to produce more accurate alignments in shorter time frames.
  • 7. CLUSTAL (CLUSTALX): •Feed sequence in fasta format (copy paste on the applet or attach a notepad file {*.txt}). E.g., > (name of the 1st sequence) Agtgatagatag………… >(name of the 2nd sequence) Gatagatcgctgatcgctc….. •Run with default. •Analyze Gaps are frequent: change the settings such that gap opening penalty is high e.g. increase from the default value of 10 to 15, 20, 25, 30. Gaps are long but less frequent: change settings such that gap extension penalty is high e.g., increase from default value of 1 to 2,3,4,5 No gaps but many mismatches: relax the gap opening (5, 6, 7,) and/or gap extension penalty (0.1, 0.2, 0.4, 0.5) such that indels might occur in the data set for a better match. REDO THE MSA ALIGNMENT TILL IT IS better.
  • 8. Manual curing of MSA •Involves intellectual curing of usually the placement of alignment gaps among the sequence alignment. This is understood more appropriately in case to case study. •Involves the removal of rouge taxons. i.e., the sequence that do not fit in the current MSA due to dis-proportionate accurence of mismatches and gaps. Usually it can be figured out after the first tree is made and the bootstrapping values and/or branch lengths of the particular lineages is questionable. (appropriate software are available). •Larger the sequence set the higher the accuracy of the tree. But also more time consuming is tree construction by maximum likelihood (ML). •More diverse the sequence set more erroneous the tree may be since it would be an approximation. Hence closely similar sequences representatives from each ordered data set needs to be selected. For eg., when talking of small molecule methyl transferases one may take a few close relatives of O-, N-, C- methyl transferases for analysis since these have considerable phylogenetic homology.
  • 9. Substitution model •The curated MSA can be included as an input to programs like jModeltest for DNA and Prottest for proteins to the pattern of substitution at each site in the MSA. Based on this pattern a list of appropriate substitution model for anaylsis is calculated. For eg. The simplest model Jukes-Cantor (JC) says that each base of DNA can be substituted at equal rate to other base in evolution. Though it is unrealistic in the practicality of life but the sequences selected might just anticipated to be obliging to this rate and thus JC can be used for analysis in PhyML. Kimura model says that transitions (Ts) (or purine to purine and pyrimidine to pyrimidine changes) and transversions (Tv) (purine to pyrimidine or vice versa) changes occur at different rates. •There are 22 DNA substitution models published and each model can have slight variants based on statistical distribution of variables like +I + G and +Y thus making it a total of 22*4=88 substitution model for DNA substitution. •+I: refers to proportion of invariable sites. (invariable sites refers to the bias incorporated due to substitution and rate heterogeneity amongst different lineages). Inclusion of this parameter ensures that the bias of sequence dissimilarity due to sequence relatedness id reduced. •+g: refers to gamma distribution of the matrix (gamma distribution is a pattern/shaape that is obserevd during statistical distribution of variants). •+y: refers to distribution or accounting for Ts/Tv ratio (incorporated due to slight variations observed between transition and transversion substitutions). e.g., MSA can follow a JC model or JC+I or JC+G or JC+Y
  • 10. Substitution model •The decision of what substitution model depends on three sattistical considerations incorporated in both jModeltest and prottest. Akaike Information Criteria (AIC), Bayesian Information criteria (BIC) and Akaike Information Criteria corrected for small samples (AICc). •The model having high scores for AIC and BIC are usually selected as appropriate substitution models for phylogenetic estimation. Phylogeny PhyML at present incorporates analysis using 32 substitution models for DNA. After adding all the tested parameters like MSA, substitution models, + I/ +G/+Y parameter options the tree building can be carried out. PhyML requires a strating user-define tree for building a phlylogenetic tree. If not available PhyML can be commanded to construct by its own a Neighbour-Joining starting tree. The tree can be improved by selecting option like SPR +NNI so that appropriateness in branch lengths can be incorporated. Finally a bootstrapping for 1000 pseudoreplicates is choosen for accuracy of branch topology.
  • 11. Bootstrapping Bootstrapping involves the program to perform the same tree building with pseudoreplicates of the sequences after breaking blocks of alignment and rearranging and then calculation how many times per hundred pseudoreplicates does a branch fall under the same topology. A bootstrap of greated than 70% is significant in general. Higher amount of pseudoreplicates chooses the more accurate is the topological calculations A bootstrap pesudoreplicate of 1000 is preferable but in consideration of time required pseudoreplicate of 100 also suffices.
  • 12. Re-construction •Once the tree is generated, the tree is broadly looked upon for accuracy by bootstrap values of each branch as well as disproportionate branch lengths. •In case of faulty trees, corrections need to be made at both aspects. •If the MSA is cured properly, then one might need to remove rogue taxons (Taxons that are problematic to the tree topology or branch length) using available softwares. The entire process from searching for optimal substitution models may needed to be repeated. •If no rogue taxons can be identified. Reducing the generality of sequence diversity could also be tried. And more relevant sequences only be included in MSA. •The NJ tree option can also be changed to a user defined tree option. •The tree construction is repeated in a number of cycles untill appropriate tree is generated.