SlideShare a Scribd company logo
Biological sequences analysis
A review of two alignment-free methods for sequence comparison
Outline
• Introduction to sequence alignment problem
• Introduction to alignment-free sequence comparison
• An LZ-complexity based alignment method
• A 2D graphical alignment method
• Methods overall comparison
Introduction to sequence alignment
• Goal: determine if a particular sequence is like another sequence
• determine if a database contains a potential homologous sequence.
Introduction to sequence alignment
• Two alignment types are used: global and local
• The global approach compares one whole sequence with other entire
sequences.
• The local method uses a subset of a sequence and attempts to align it to
subset of other sequences.
Introduction to sequence alignment
• The global alignment looks for comparison over the entire range of
the two sequences involved.
GCATTACTAATATATTAGTAAATCAGAGTAGTA
||||||||| ||
AAGCGAATAATATATTTATACTCAGATTATTGCGCG
Introduction to sequence alignment
• By contrast, when a local alignment is performed, a small seed is
uncovered that can be used to quickly extend the alignment.
• The initial seed for the alignment:
TAT
|||
AAGCGAATAATATATTTATACTCAGATTATTGCGCG
Introduction to sequence alignment
• By contrast, when a local alignment is performed, a small seed is
uncovered that can be used to quickly extend the alignment.
• And now the extended alignment:
TATATATTAGTA
||||||||| ||
AAGCGAATAATATATTTATACTCAGATTATTGCGCG
Introduction to sequence alignment
• How to search similiarities in genetic sequences?
• Naive methods: comparing all possibles alignments (extremely slow)
• Heuristics methods
• Examples: BLAST, FASTA, …
• Optimal solution is not guaranteed
• Tradeoff: Speed vs Accuracy
• Dynamic programming methods
• Examples: Needleman & Wunsch, Smith & Waterman
Introduction to sequence alignment
• How to search similiarities in genetic sequences?
• Naive methods: comparing all possibles alignments (extremely slow)
• Heuristics methods
• Examples: BLASTA, FASTA, …
• Optimal solution is not guaranteed
• Tradeoff: Speed vs Accuracy
• Dynamic programming methods
• Examples: Needleman & Wunsch, Smith & Waterman
• …faster alternatives?
Alignment-free comparison
• Challenge: overcome the traditional alignment-based algorithm
inefficiency
• Alignment-based methods
• Slow
• May produce incorrect results when used on more divergent but functionally
related sequences
Alignment-free comparison
• Much faster than alignment-based methods
• most methods work in linear time
• Four categories:
• methods based on k-mer/word frequency,
• methods based on substrings,
• methods based on information theory (LZ-complexity based method) and
• methods based on graphical representation (2D-graphical method)
LZ-complexity based sequence comparison
• Method based on information theory
• Analysis of DNA/Proteic sequences
• Built upon the LZ-complexity measure
• Dynamic programming algorithm
LZ-complexity
• Complexity measure for finite sequences
• LZ-complexity as entropy rate estimator for finite sequences
• Produces a dictionary of productions for a sequence 𝑆.
• “The proposed complexity measure is related to the number of steps
in a self-delimiting production process by which a given sequence is
presumed to be generated” (Abraham Lempel and Jacob Ziv, "On the Complexity of
Individual Sequences“, 1976)
LZ-complexity (production process)
• 𝑚-step production process of a finite sequence 𝑆
𝐻 𝑆 = 𝑆 1, ℎ1 ∗ 𝑆 ℎ1 + 1, ℎ2 , … , 𝑆(ℎ 𝑚−1 + 1, ℎ 𝑚)
• 𝐻 𝑆 is called history of 𝑆 and 𝐻𝑖 𝑆 = 𝑆(ℎ𝑖−1 + 1, ℎ𝑖) is called the
ith component of 𝐻 𝑆 .
• Each component 𝐻𝑖 𝑆 is added into a dictionary
LZ-complexity (algorithm)
Initialize the dictionary
repeat until the sequence have not been consumed
 Add the next symbol to the current subsequence.
 If the subsequence is reproducible from the previous history, add to the
dictionary and increase index value
LZ-complexity (algorithm)
Initialize the dictionary
repeat until the sequence have not been consumed
 Add the next symbol to the current subsequence.
 If the subsequence is reproducible from the previous history, add to the
dictionary and increase index value
The production process inserts a comma (',') into a sequence 𝑆 after the
creation of each new phrase formed by the concatenation of the longest
recognized dictionary phrase and the innovative symbol that follows.
LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A
2 T
3 G
4 G
5 T
6 C
7 G
8 G
9 T
10 T
11 T
12 C
LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T
3 G
4 G
5 T
6 C
7 G
8 G
9 T
10 T
11 T
12 C
LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G
4 G
5 T
6 C
7 G
8 G
9 T
10 T
11 T
12 C
LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T
6 C
7 G
8 G
9 T
10 T
11 T
12 C
LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C
7 G
8 G
9 T
10 T
11 T
12 C
LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T
11 T
12 C
LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T GGTT 6
11 T
12 C
LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T GGTT 6
11 T
12 C TC 7
LZ-complexity (Example)
• S = ATGGTCGGTTTC
• The complexity 𝑐 𝑆 of the
sequence S is
• 𝑐 𝑆 = 7
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T GGTT 6
11 T
12 C TC 7
LZ-complexity (Example)
• S = ATGGTCGGTTTC
• The complexity 𝑐 𝑆 of the
sequence S is
• 𝑐 𝑆 = 7
• The history of 𝑆 is
• 𝐻 𝑆 = {𝐴, 𝑇, 𝐺, 𝐺𝑇, 𝐶, 𝐺𝐺𝑇𝑇, 𝑇𝐶}
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T GGTT 6
11 T
12 C TC 7
LZ-complexity based sequence comparison
• Based on the number of components in the LZ-complexity
decomposition of the DNA sequences.
• Given two sequences S and Q decomposed using the LZ-complexity:
𝑆 = 𝑆1 𝑆2…𝑆 𝑘…𝑆 𝑚
𝑄 = 𝑄1 𝑄…𝑄 𝑘…𝑄 𝑛
 𝑚 is the number of fragments of 𝑆
 𝑛 is the number of fragments of 𝑄
LZ-complexity based sequence comparison
• Let 𝜎 be a score function used to build the dynamic programming
matrix. It is defined as follows:
𝜎 𝑆𝑖, _ = 𝜎 _, 𝑄𝑖 = 1
𝜎 𝑆𝑖, 𝑄𝑗 = 1 −
𝑁(𝑆𝑖, 𝑄𝑗)
max 𝑙𝑒𝑛𝑔𝑡ℎ(𝑆𝑖, 𝑄𝑗)
• where 𝑁(𝑆𝑖, 𝑄𝑗) is the number of the same elements of fragment 𝑆𝑖
and 𝑄𝑗.
LZ-complexity based sequence comparison
• The sequence similarity matrix 𝑀 is built using the following
formulas:
𝑀 𝑖, 0 = 𝑘=1
𝑖
𝜎 𝑆𝑖, _
𝑀 0, 𝑗 = 𝑘=1
𝑗
𝜎 _, 𝑄𝑗
𝑀[𝑖, 𝑗] = min
𝑀 𝑖 − 1, 𝑗 + 𝜎(𝑆𝑖, _)
𝑀 𝑖 − 1, 𝑗 − 1 + 𝜎(𝑆𝑖, 𝑄𝑖)
𝑀 𝑖, 𝑗 − 1 + 𝜎(_, 𝑄𝑖)
𝑀 𝑖 − 1, 𝑗 − 1 𝑀 𝑖 − 1, 𝑗
𝑀 𝑖, 𝑗 − 1
Example
Q→ A T G TGA ATGC AT
S↓ 0 1 2 4 8 16 32
A 1 0 1 2 3 4 5
T 2 1 0 1 2 3 4
G 4 2 1 0 1 2 3
GT 8 3 2 1 0.333 1.333 2.333
C 16 4 3 2 1.333 1.083 2.083
GGTT 32 5 4 3 2.333 1.833 1.833
TC 64 6 5 4 3.333 2.833 2.333
Example
𝑀[𝑚, 𝑛] is the similarity distance between sequences 𝑆 and 𝑄
Q→ A T G TGA ATGC AT
S↓ 0 1 2 4 8 16 32
A 1 0 1 2 3 4 5
T 2 1 0 1 2 3 4
G 4 2 1 0 1 2 3
GT 8 3 2 1 0.333 1.333 2.333
C 16 4 3 2 1.333 1.083 2.083
GGTT 32 5 4 3 2.333 1.833 1.833
TC 64 6 5 4 3.333 2.833 2.333
Results
• Data set: sequences of the firtst exon of 𝛽-globin gene of 11 species
• Method:
Calculate the similarity degree among the sequences using the proposed
method (LZ-complexity + dynamic programming)
Arrange all the similarity degrees into a matrix
Put the pair-wise distances into a neighbor-joining program in the PHYLIP
package
Results
Results
G. Huang et al. (2D-graphical method)
• Method based on graphical representation
• Four vector correspond to four groups of nucleotides:
𝐴 → (1, −
3
3)
𝑇 → (1,
3
2)
𝐺 → (1, − 5)
𝐶 → (1, 3)
G. Huang et al. (2D-graphical method)
• DNA sequence can be turned into a graphical curve
G. Huang et al. (2D-graphical method)
• Graphs shows intuitively (dis)similarity between sequences.
G. Huang et al. (2D-graphical method)
• Graphs shows intuitively (dis)similarity between sequences.
G. Huang et al. (2D-graphical method)
• How to compare sequences?
• Similarity among sequences can be quantified by computing distance
between either vectors or points.
• Spatial distances
• Euclidean distance
• Mahalanobis distance
• Standard Euclidean distance
• Cosine similarity
• Stuart et al. (2002)
Euclidean distance
• Given two vectors 𝐴 = {𝑎1, 𝑎2, … , 𝑎 𝑛} and 𝐵 = {𝑏1, 𝑏2, … , 𝑏 𝑛}, the
Euclidean distance is computed as follow:
𝐸𝐷 𝐴, 𝐵 =
𝑖=1
𝑛
𝑎𝑖 − 𝑏𝑖
2
Mahalanobis distance
• The Mahalanobis distance takes into account the data covariance
relationship. It is defined as follow:
𝑀𝐷 𝐴, 𝐵 = 𝐴 − 𝐵 𝐶𝑉−1 𝐴 − 𝐵 ′
• 𝐶𝑉 is the covariance matrix
Standard Euclidean distance
• Standard Euclidean Distance (SED) considers merely the variance of n
variables.
Cosine similarity
• Stuart et al. define a distance using the angles between vectors. It is
defined as follow:
𝐴𝐷 𝐴, 𝐵 =
𝐴 ∙ 𝐵
𝐴 × 𝐵
=
𝑖=1
𝑛
𝑎𝑖 𝑏𝑖
𝑖=1
𝑛
𝑎𝑖
2
𝑖=1
𝑛
𝑏𝑖
2
𝐸𝐴𝐷 𝐴, 𝐵 = − ln 1 + 𝐴𝐷 𝐴, 𝐵 ∕ 2
• Where 𝐴𝐷(𝐴, 𝐵) is the cosine similarity between 𝐴 and 𝐵, 𝐸𝐴𝐷 𝐴, 𝐵
represents the evolutionary distance between 𝐴 and 𝐵.
Results
• Two data sets have been used
• a real sequences set
• Human mithocondrial genome
• a random sequences set
• Obtained by applying random mutation on the real sequences set
(1%, 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% mutatio
n rates)
• Euclidean, SED, Mahalanobis and EAD distance have been used
Results
• 𝑑 𝑥 denotes the distance
between a sequence and its
randomly mutated version.
• The Euclidian distance is more
sensitive to mutation rate than
the other three distance.
Results
• 35 mitochondrial genome sequences from different mammals
(GeneBank db)
• Primates species including human, ape, gorilla, chimpazees, etc. are
grouped together
• Result is in agreement with that obtained by Yu et al.(2010) and Raina
et al. (2005)
Results
Presented methods comparison
LZ-complexity based algorithm 2D-graphic based algorithm
Dynamic programming algorithm Graphical algorithm
LZ-complexity measure Various distances (ED, Mahalanobis,…)
Generic (DNA/proteins) DNA-specific
Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
Presented methods comparison
LZ-complexity based algorithm 2D-graphic based algorithm
Dynamic programming algorithm Graphical algorithm
LZ-complexity measure Various distances (ED, Mahalanobis,…)
Generic (DNA/proteins) DNA-specific
Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
Presented methods comparison
LZ-complexity based algorithm 2D-graphic based algorithm
Dynamic programming algorithm Graphical algorithm
LZ-complexity measure Various distances (ED, Mahalanobis,…)
Generic (DNA/proteins) DNA-specific
Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
Position Symbol Add to dictionary Index Rate
1 A A 1 1
2 T T 2 1
3 G G 3 1
4 G
5 T GT 4 0.80
.. .. .. .. ..
𝐸𝐷 𝐴, 𝐵 =
𝑖=1
𝑛
𝑎𝑖 − 𝑏𝑖
2
Presented methods comparison
LZ-complexity based algorithm 2D-graphic based algorithm
Dynamic programming algorithm Graphical algorithm
LZ-complexity measure Various distances (ED, Mahalanobis,…)
Generic (DNA/proteins) DNA-specific
Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
Presented methods comparison
LZ-complexity based algorithm 2D-graphic based algorithm
Dynamic programming algorithm Graphical algorithm
LZ-complexity measure Various distances (ED, Mahalanobis,…)
Generic (DNA/proteins) DNA-specific
Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
Biological sequences analysis
Biological sequences analysis
Biological sequences analysis
Biological sequences analysis

More Related Content

What's hot

Protein database
Protein databaseProtein database
Protein database
Khalid Hakeem
 
I- Tasser
I- TasserI- Tasser
I- Tasser
Animesh Kumar
 
Multiple Alignment Sequence using Clustal Omega/ Shumaila Riaz
Multiple Alignment Sequence using Clustal Omega/ Shumaila RiazMultiple Alignment Sequence using Clustal Omega/ Shumaila Riaz
Multiple Alignment Sequence using Clustal Omega/ Shumaila Riaz
ShumailaRiaz6
 
BLAST
BLASTBLAST
BLAST
Rabia W.
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
Zeeshan Hanjra
 
Protein Databases
Protein DatabasesProtein Databases
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
sworna kumari chithiraivelu
 
Secondary structure prediction
Secondary structure predictionSecondary structure prediction
Secondary structure prediction
samantlalit
 
gene prediction programs
gene prediction programsgene prediction programs
gene prediction programs
MugdhaSharma11
 
Fasta
FastaFasta
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysis
BITS
 
BLAST
BLASTBLAST
Swiss prot database
Swiss prot databaseSwiss prot database
Swiss prot database
sagrika chugh
 
Protein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modelingProtein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modeling
Bioinformatics and Computational Biosciences Branch
 
Sequence similarity tools.pptx
Sequence similarity tools.pptxSequence similarity tools.pptx
Sequence similarity tools.pptx
PagudalaSangeetha
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
Afra Fathima
 
Protein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural AlignmentProtein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural Alignment
Saramita De Chakravarti
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
Meghaj Mallick
 
The Gene Ontology & Gene Ontology Annotation resources
The Gene Ontology & Gene Ontology Annotation resourcesThe Gene Ontology & Gene Ontology Annotation resources
The Gene Ontology & Gene Ontology Annotation resources
Melanie Courtot
 

What's hot (20)

Protein database
Protein databaseProtein database
Protein database
 
I- Tasser
I- TasserI- Tasser
I- Tasser
 
Multiple Alignment Sequence using Clustal Omega/ Shumaila Riaz
Multiple Alignment Sequence using Clustal Omega/ Shumaila RiazMultiple Alignment Sequence using Clustal Omega/ Shumaila Riaz
Multiple Alignment Sequence using Clustal Omega/ Shumaila Riaz
 
BLAST
BLASTBLAST
BLAST
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Protein Databases
Protein DatabasesProtein Databases
Protein Databases
 
Introduction to databases.pptx
Introduction to databases.pptxIntroduction to databases.pptx
Introduction to databases.pptx
 
Secondary structure prediction
Secondary structure predictionSecondary structure prediction
Secondary structure prediction
 
gene prediction programs
gene prediction programsgene prediction programs
gene prediction programs
 
Fasta
FastaFasta
Fasta
 
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysis
 
BLAST
BLASTBLAST
BLAST
 
EMBL-EBI
EMBL-EBIEMBL-EBI
EMBL-EBI
 
Swiss prot database
Swiss prot databaseSwiss prot database
Swiss prot database
 
Protein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modelingProtein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modeling
 
Sequence similarity tools.pptx
Sequence similarity tools.pptxSequence similarity tools.pptx
Sequence similarity tools.pptx
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Protein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural AlignmentProtein Structure, Databases and Structural Alignment
Protein Structure, Databases and Structural Alignment
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
The Gene Ontology & Gene Ontology Annotation resources
The Gene Ontology & Gene Ontology Annotation resourcesThe Gene Ontology & Gene Ontology Annotation resources
The Gene Ontology & Gene Ontology Annotation resources
 

Similar to Biological sequences analysis

Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
ChenYiHuang5
 
The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment
Parinda Rajapaksha
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis
Baivab Nag
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
ammar kareem
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Avijit Famous
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
alizain9604
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
ssuser2624f71
 
Sequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdfSequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdf
sriaisvariyasundar
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Asiri Wijesinghe
 
Sequence alignment global vs. local
Sequence alignment  global vs. localSequence alignment  global vs. local
Sequence alignment global vs. local
benazeer fathima
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
Tsuyoshi Sakama
 
sorting
sortingsorting
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
Sanaym
 
Parallel DNA Sequence Alignment
Parallel DNA Sequence AlignmentParallel DNA Sequence Alignment
Parallel DNA Sequence Alignment
Giuliana Carullo
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
nikshaikh786
 
Searching Algorithms
Searching AlgorithmsSearching Algorithms
Searching Algorithms
Afaq Mansoor Khan
 
Bioinformatics lesson
Bioinformatics lessonBioinformatics lesson
Bioinformatics lesson
Daffodil International University
 
Bioinformatics lesson
Bioinformatics lessonBioinformatics lesson
Bioinformatics lesson
Daffodil International University
 

Similar to Biological sequences analysis (20)

Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment The Needleman-Wunsch Algorithm for Sequence Alignment
The Needleman-Wunsch Algorithm for Sequence Alignment
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis
 
Ch06 multalign
Ch06 multalignCh06 multalign
Ch06 multalign
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Ch06 alignment
Ch06 alignmentCh06 alignment
Ch06 alignment
 
lecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadflecture4.ppt Sequence Alignmentaldf sdfsadf
lecture4.ppt Sequence Alignmentaldf sdfsadf
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
 
Sequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdfSequence-analysis-pairwise-alignment.pdf
Sequence-analysis-pairwise-alignment.pdf
 
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool (BLAST)
 
Sequence alignment global vs. local
Sequence alignment  global vs. localSequence alignment  global vs. local
Sequence alignment global vs. local
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
 
sorting
sortingsorting
sorting
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Parallel DNA Sequence Alignment
Parallel DNA Sequence AlignmentParallel DNA Sequence Alignment
Parallel DNA Sequence Alignment
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
Searching Algorithms
Searching AlgorithmsSearching Algorithms
Searching Algorithms
 
Bioinformatics lesson
Bioinformatics lessonBioinformatics lesson
Bioinformatics lesson
 
Bioinformatics lesson
Bioinformatics lessonBioinformatics lesson
Bioinformatics lesson
 

Recently uploaded

Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
zeex60
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
alishadewangan1
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), EligibilityISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
SciAstra
 

Recently uploaded (20)

Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
nodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptxnodule formation by alisha dewangan.pptx
nodule formation by alisha dewangan.pptx
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), EligibilityISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
ISI 2024: Application Form (Extended), Exam Date (Out), Eligibility
 

Biological sequences analysis

  • 1. Biological sequences analysis A review of two alignment-free methods for sequence comparison
  • 2. Outline • Introduction to sequence alignment problem • Introduction to alignment-free sequence comparison • An LZ-complexity based alignment method • A 2D graphical alignment method • Methods overall comparison
  • 3. Introduction to sequence alignment • Goal: determine if a particular sequence is like another sequence • determine if a database contains a potential homologous sequence.
  • 4. Introduction to sequence alignment • Two alignment types are used: global and local • The global approach compares one whole sequence with other entire sequences. • The local method uses a subset of a sequence and attempts to align it to subset of other sequences.
  • 5. Introduction to sequence alignment • The global alignment looks for comparison over the entire range of the two sequences involved. GCATTACTAATATATTAGTAAATCAGAGTAGTA ||||||||| || AAGCGAATAATATATTTATACTCAGATTATTGCGCG
  • 6. Introduction to sequence alignment • By contrast, when a local alignment is performed, a small seed is uncovered that can be used to quickly extend the alignment. • The initial seed for the alignment: TAT ||| AAGCGAATAATATATTTATACTCAGATTATTGCGCG
  • 7. Introduction to sequence alignment • By contrast, when a local alignment is performed, a small seed is uncovered that can be used to quickly extend the alignment. • And now the extended alignment: TATATATTAGTA ||||||||| || AAGCGAATAATATATTTATACTCAGATTATTGCGCG
  • 8. Introduction to sequence alignment • How to search similiarities in genetic sequences? • Naive methods: comparing all possibles alignments (extremely slow) • Heuristics methods • Examples: BLAST, FASTA, … • Optimal solution is not guaranteed • Tradeoff: Speed vs Accuracy • Dynamic programming methods • Examples: Needleman & Wunsch, Smith & Waterman
  • 9. Introduction to sequence alignment • How to search similiarities in genetic sequences? • Naive methods: comparing all possibles alignments (extremely slow) • Heuristics methods • Examples: BLASTA, FASTA, … • Optimal solution is not guaranteed • Tradeoff: Speed vs Accuracy • Dynamic programming methods • Examples: Needleman & Wunsch, Smith & Waterman • …faster alternatives?
  • 10. Alignment-free comparison • Challenge: overcome the traditional alignment-based algorithm inefficiency • Alignment-based methods • Slow • May produce incorrect results when used on more divergent but functionally related sequences
  • 11. Alignment-free comparison • Much faster than alignment-based methods • most methods work in linear time • Four categories: • methods based on k-mer/word frequency, • methods based on substrings, • methods based on information theory (LZ-complexity based method) and • methods based on graphical representation (2D-graphical method)
  • 12. LZ-complexity based sequence comparison • Method based on information theory • Analysis of DNA/Proteic sequences • Built upon the LZ-complexity measure • Dynamic programming algorithm
  • 13. LZ-complexity • Complexity measure for finite sequences • LZ-complexity as entropy rate estimator for finite sequences • Produces a dictionary of productions for a sequence 𝑆. • “The proposed complexity measure is related to the number of steps in a self-delimiting production process by which a given sequence is presumed to be generated” (Abraham Lempel and Jacob Ziv, "On the Complexity of Individual Sequences“, 1976)
  • 14. LZ-complexity (production process) • 𝑚-step production process of a finite sequence 𝑆 𝐻 𝑆 = 𝑆 1, ℎ1 ∗ 𝑆 ℎ1 + 1, ℎ2 , … , 𝑆(ℎ 𝑚−1 + 1, ℎ 𝑚) • 𝐻 𝑆 is called history of 𝑆 and 𝐻𝑖 𝑆 = 𝑆(ℎ𝑖−1 + 1, ℎ𝑖) is called the ith component of 𝐻 𝑆 . • Each component 𝐻𝑖 𝑆 is added into a dictionary
  • 15. LZ-complexity (algorithm) Initialize the dictionary repeat until the sequence have not been consumed  Add the next symbol to the current subsequence.  If the subsequence is reproducible from the previous history, add to the dictionary and increase index value
  • 16. LZ-complexity (algorithm) Initialize the dictionary repeat until the sequence have not been consumed  Add the next symbol to the current subsequence.  If the subsequence is reproducible from the previous history, add to the dictionary and increase index value The production process inserts a comma (',') into a sequence 𝑆 after the creation of each new phrase formed by the concatenation of the longest recognized dictionary phrase and the innovative symbol that follows.
  • 17. LZ-complexity (Example) • S = ATGGTCGGTTTC Position Symbol Add to dictionary Index 1 A 2 T 3 G 4 G 5 T 6 C 7 G 8 G 9 T 10 T 11 T 12 C
  • 18. LZ-complexity (Example) • S = ATGGTCGGTTTC Position Symbol Add to dictionary Index 1 A A 1 2 T 3 G 4 G 5 T 6 C 7 G 8 G 9 T 10 T 11 T 12 C
  • 19. LZ-complexity (Example) • S = ATGGTCGGTTTC Position Symbol Add to dictionary Index 1 A A 1 2 T T 2 3 G 4 G 5 T 6 C 7 G 8 G 9 T 10 T 11 T 12 C
  • 20. LZ-complexity (Example) • S = ATGGTCGGTTTC Position Symbol Add to dictionary Index 1 A A 1 2 T T 2 3 G G 3 4 G 5 T 6 C 7 G 8 G 9 T 10 T 11 T 12 C
  • 21. LZ-complexity (Example) • S = ATGGTCGGTTTC Position Symbol Add to dictionary Index 1 A A 1 2 T T 2 3 G G 3 4 G 5 T GT 4 6 C 7 G 8 G 9 T 10 T 11 T 12 C
  • 22. LZ-complexity (Example) • S = ATGGTCGGTTTC Position Symbol Add to dictionary Index 1 A A 1 2 T T 2 3 G G 3 4 G 5 T GT 4 6 C C 5 7 G 8 G 9 T 10 T 11 T 12 C
  • 23. LZ-complexity (Example) • S = ATGGTCGGTTTC Position Symbol Add to dictionary Index 1 A A 1 2 T T 2 3 G G 3 4 G 5 T GT 4 6 C C 5 7 G 8 G 9 T 10 T GGTT 6 11 T 12 C
  • 24. LZ-complexity (Example) • S = ATGGTCGGTTTC Position Symbol Add to dictionary Index 1 A A 1 2 T T 2 3 G G 3 4 G 5 T GT 4 6 C C 5 7 G 8 G 9 T 10 T GGTT 6 11 T 12 C TC 7
  • 25. LZ-complexity (Example) • S = ATGGTCGGTTTC • The complexity 𝑐 𝑆 of the sequence S is • 𝑐 𝑆 = 7 Position Symbol Add to dictionary Index 1 A A 1 2 T T 2 3 G G 3 4 G 5 T GT 4 6 C C 5 7 G 8 G 9 T 10 T GGTT 6 11 T 12 C TC 7
  • 26. LZ-complexity (Example) • S = ATGGTCGGTTTC • The complexity 𝑐 𝑆 of the sequence S is • 𝑐 𝑆 = 7 • The history of 𝑆 is • 𝐻 𝑆 = {𝐴, 𝑇, 𝐺, 𝐺𝑇, 𝐶, 𝐺𝐺𝑇𝑇, 𝑇𝐶} Position Symbol Add to dictionary Index 1 A A 1 2 T T 2 3 G G 3 4 G 5 T GT 4 6 C C 5 7 G 8 G 9 T 10 T GGTT 6 11 T 12 C TC 7
  • 27. LZ-complexity based sequence comparison • Based on the number of components in the LZ-complexity decomposition of the DNA sequences. • Given two sequences S and Q decomposed using the LZ-complexity: 𝑆 = 𝑆1 𝑆2…𝑆 𝑘…𝑆 𝑚 𝑄 = 𝑄1 𝑄…𝑄 𝑘…𝑄 𝑛  𝑚 is the number of fragments of 𝑆  𝑛 is the number of fragments of 𝑄
  • 28. LZ-complexity based sequence comparison • Let 𝜎 be a score function used to build the dynamic programming matrix. It is defined as follows: 𝜎 𝑆𝑖, _ = 𝜎 _, 𝑄𝑖 = 1 𝜎 𝑆𝑖, 𝑄𝑗 = 1 − 𝑁(𝑆𝑖, 𝑄𝑗) max 𝑙𝑒𝑛𝑔𝑡ℎ(𝑆𝑖, 𝑄𝑗) • where 𝑁(𝑆𝑖, 𝑄𝑗) is the number of the same elements of fragment 𝑆𝑖 and 𝑄𝑗.
  • 29. LZ-complexity based sequence comparison • The sequence similarity matrix 𝑀 is built using the following formulas: 𝑀 𝑖, 0 = 𝑘=1 𝑖 𝜎 𝑆𝑖, _ 𝑀 0, 𝑗 = 𝑘=1 𝑗 𝜎 _, 𝑄𝑗 𝑀[𝑖, 𝑗] = min 𝑀 𝑖 − 1, 𝑗 + 𝜎(𝑆𝑖, _) 𝑀 𝑖 − 1, 𝑗 − 1 + 𝜎(𝑆𝑖, 𝑄𝑖) 𝑀 𝑖, 𝑗 − 1 + 𝜎(_, 𝑄𝑖) 𝑀 𝑖 − 1, 𝑗 − 1 𝑀 𝑖 − 1, 𝑗 𝑀 𝑖, 𝑗 − 1
  • 30. Example Q→ A T G TGA ATGC AT S↓ 0 1 2 4 8 16 32 A 1 0 1 2 3 4 5 T 2 1 0 1 2 3 4 G 4 2 1 0 1 2 3 GT 8 3 2 1 0.333 1.333 2.333 C 16 4 3 2 1.333 1.083 2.083 GGTT 32 5 4 3 2.333 1.833 1.833 TC 64 6 5 4 3.333 2.833 2.333
  • 31. Example 𝑀[𝑚, 𝑛] is the similarity distance between sequences 𝑆 and 𝑄 Q→ A T G TGA ATGC AT S↓ 0 1 2 4 8 16 32 A 1 0 1 2 3 4 5 T 2 1 0 1 2 3 4 G 4 2 1 0 1 2 3 GT 8 3 2 1 0.333 1.333 2.333 C 16 4 3 2 1.333 1.083 2.083 GGTT 32 5 4 3 2.333 1.833 1.833 TC 64 6 5 4 3.333 2.833 2.333
  • 32. Results • Data set: sequences of the firtst exon of 𝛽-globin gene of 11 species • Method: Calculate the similarity degree among the sequences using the proposed method (LZ-complexity + dynamic programming) Arrange all the similarity degrees into a matrix Put the pair-wise distances into a neighbor-joining program in the PHYLIP package
  • 35. G. Huang et al. (2D-graphical method) • Method based on graphical representation • Four vector correspond to four groups of nucleotides: 𝐴 → (1, − 3 3) 𝑇 → (1, 3 2) 𝐺 → (1, − 5) 𝐶 → (1, 3)
  • 36. G. Huang et al. (2D-graphical method) • DNA sequence can be turned into a graphical curve
  • 37. G. Huang et al. (2D-graphical method) • Graphs shows intuitively (dis)similarity between sequences.
  • 38. G. Huang et al. (2D-graphical method) • Graphs shows intuitively (dis)similarity between sequences.
  • 39. G. Huang et al. (2D-graphical method) • How to compare sequences? • Similarity among sequences can be quantified by computing distance between either vectors or points. • Spatial distances • Euclidean distance • Mahalanobis distance • Standard Euclidean distance • Cosine similarity • Stuart et al. (2002)
  • 40. Euclidean distance • Given two vectors 𝐴 = {𝑎1, 𝑎2, … , 𝑎 𝑛} and 𝐵 = {𝑏1, 𝑏2, … , 𝑏 𝑛}, the Euclidean distance is computed as follow: 𝐸𝐷 𝐴, 𝐵 = 𝑖=1 𝑛 𝑎𝑖 − 𝑏𝑖 2
  • 41. Mahalanobis distance • The Mahalanobis distance takes into account the data covariance relationship. It is defined as follow: 𝑀𝐷 𝐴, 𝐵 = 𝐴 − 𝐵 𝐶𝑉−1 𝐴 − 𝐵 ′ • 𝐶𝑉 is the covariance matrix
  • 42. Standard Euclidean distance • Standard Euclidean Distance (SED) considers merely the variance of n variables.
  • 43. Cosine similarity • Stuart et al. define a distance using the angles between vectors. It is defined as follow: 𝐴𝐷 𝐴, 𝐵 = 𝐴 ∙ 𝐵 𝐴 × 𝐵 = 𝑖=1 𝑛 𝑎𝑖 𝑏𝑖 𝑖=1 𝑛 𝑎𝑖 2 𝑖=1 𝑛 𝑏𝑖 2 𝐸𝐴𝐷 𝐴, 𝐵 = − ln 1 + 𝐴𝐷 𝐴, 𝐵 ∕ 2 • Where 𝐴𝐷(𝐴, 𝐵) is the cosine similarity between 𝐴 and 𝐵, 𝐸𝐴𝐷 𝐴, 𝐵 represents the evolutionary distance between 𝐴 and 𝐵.
  • 44. Results • Two data sets have been used • a real sequences set • Human mithocondrial genome • a random sequences set • Obtained by applying random mutation on the real sequences set (1%, 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% mutatio n rates) • Euclidean, SED, Mahalanobis and EAD distance have been used
  • 45. Results • 𝑑 𝑥 denotes the distance between a sequence and its randomly mutated version. • The Euclidian distance is more sensitive to mutation rate than the other three distance.
  • 46. Results • 35 mitochondrial genome sequences from different mammals (GeneBank db) • Primates species including human, ape, gorilla, chimpazees, etc. are grouped together • Result is in agreement with that obtained by Yu et al.(2010) and Raina et al. (2005)
  • 48. Presented methods comparison LZ-complexity based algorithm 2D-graphic based algorithm Dynamic programming algorithm Graphical algorithm LZ-complexity measure Various distances (ED, Mahalanobis,…) Generic (DNA/proteins) DNA-specific Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
  • 49. Presented methods comparison LZ-complexity based algorithm 2D-graphic based algorithm Dynamic programming algorithm Graphical algorithm LZ-complexity measure Various distances (ED, Mahalanobis,…) Generic (DNA/proteins) DNA-specific Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
  • 50. Presented methods comparison LZ-complexity based algorithm 2D-graphic based algorithm Dynamic programming algorithm Graphical algorithm LZ-complexity measure Various distances (ED, Mahalanobis,…) Generic (DNA/proteins) DNA-specific Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results Position Symbol Add to dictionary Index Rate 1 A A 1 1 2 T T 2 1 3 G G 3 1 4 G 5 T GT 4 0.80 .. .. .. .. .. 𝐸𝐷 𝐴, 𝐵 = 𝑖=1 𝑛 𝑎𝑖 − 𝑏𝑖 2
  • 51. Presented methods comparison LZ-complexity based algorithm 2D-graphic based algorithm Dynamic programming algorithm Graphical algorithm LZ-complexity measure Various distances (ED, Mahalanobis,…) Generic (DNA/proteins) DNA-specific Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
  • 52. Presented methods comparison LZ-complexity based algorithm 2D-graphic based algorithm Dynamic programming algorithm Graphical algorithm LZ-complexity measure Various distances (ED, Mahalanobis,…) Generic (DNA/proteins) DNA-specific Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results

Editor's Notes

  1. PHYLIP is a free package of programs for inferring phylogenies