Biological sequences analysis
A review of two alignment-free methods for sequence comparison
Outline
• Introduction to sequence alignment problem
• Introduction to alignment-free sequence comparison
• An LZ-complexity based alignment method
• A 2D graphical alignment method
• Methods overall comparison
Introduction to sequence alignment
• Goal: determine if a particular sequence is like another sequence
• determine if a database contains a potential homologous sequence.
Introduction to sequence alignment
• Two alignment types are used: global and local
• The global approach compares one whole sequence with other entire
sequences.
• The local method uses a subset of a sequence and attempts to align it to
subset of other sequences.
Introduction to sequence alignment
• The global alignment looks for comparison over the entire range of
the two sequences involved.
GCATTACTAATATATTAGTAAATCAGAGTAGTA
||||||||| ||
AAGCGAATAATATATTTATACTCAGATTATTGCGCG
Introduction to sequence alignment
• By contrast, when a local alignment is performed, a small seed is
uncovered that can be used to quickly extend the alignment.
• The initial seed for the alignment:
TAT
|||
AAGCGAATAATATATTTATACTCAGATTATTGCGCG
Introduction to sequence alignment
• By contrast, when a local alignment is performed, a small seed is
uncovered that can be used to quickly extend the alignment.
• And now the extended alignment:
TATATATTAGTA
||||||||| ||
AAGCGAATAATATATTTATACTCAGATTATTGCGCG
Introduction to sequence alignment
• How to search similiarities in genetic sequences?
• Naive methods: comparing all possibles alignments (extremely slow)
• Heuristics methods
• Examples: BLAST, FASTA, …
• Optimal solution is not guaranteed
• Tradeoff: Speed vs Accuracy
• Dynamic programming methods
• Examples: Needleman & Wunsch, Smith & Waterman
Introduction to sequence alignment
• How to search similiarities in genetic sequences?
• Naive methods: comparing all possibles alignments (extremely slow)
• Heuristics methods
• Examples: BLASTA, FASTA, …
• Optimal solution is not guaranteed
• Tradeoff: Speed vs Accuracy
• Dynamic programming methods
• Examples: Needleman & Wunsch, Smith & Waterman
• …faster alternatives?
Alignment-free comparison
• Challenge: overcome the traditional alignment-based algorithm
inefficiency
• Alignment-based methods
• Slow
• May produce incorrect results when used on more divergent but functionally
related sequences
Alignment-free comparison
• Much faster than alignment-based methods
• most methods work in linear time
• Four categories:
• methods based on k-mer/word frequency,
• methods based on substrings,
• methods based on information theory (LZ-complexity based method) and
• methods based on graphical representation (2D-graphical method)
LZ-complexity based sequence comparison
• Method based on information theory
• Analysis of DNA/Proteic sequences
• Built upon the LZ-complexity measure
• Dynamic programming algorithm
LZ-complexity
• Complexity measure for finite sequences
• LZ-complexity as entropy rate estimator for finite sequences
• Produces a dictionary of productions for a sequence 𝑆.
• “The proposed complexity measure is related to the number of steps
in a self-delimiting production process by which a given sequence is
presumed to be generated” (Abraham Lempel and Jacob Ziv, "On the Complexity of
Individual Sequences“, 1976)
LZ-complexity (production process)
• 𝑚-step production process of a finite sequence 𝑆
𝐻 𝑆 = 𝑆 1, ℎ1 ∗ 𝑆 ℎ1 + 1, ℎ2 , … , 𝑆(ℎ 𝑚−1 + 1, ℎ 𝑚)
• 𝐻 𝑆 is called history of 𝑆 and 𝐻𝑖 𝑆 = 𝑆(ℎ𝑖−1 + 1, ℎ𝑖) is called the
ith component of 𝐻 𝑆 .
• Each component 𝐻𝑖 𝑆 is added into a dictionary
LZ-complexity (algorithm)
Initialize the dictionary
repeat until the sequence have not been consumed
 Add the next symbol to the current subsequence.
 If the subsequence is reproducible from the previous history, add to the
dictionary and increase index value
LZ-complexity (algorithm)
Initialize the dictionary
repeat until the sequence have not been consumed
 Add the next symbol to the current subsequence.
 If the subsequence is reproducible from the previous history, add to the
dictionary and increase index value
The production process inserts a comma (',') into a sequence 𝑆 after the
creation of each new phrase formed by the concatenation of the longest
recognized dictionary phrase and the innovative symbol that follows.
LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A
2 T
3 G
4 G
5 T
6 C
7 G
8 G
9 T
10 T
11 T
12 C
LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T
3 G
4 G
5 T
6 C
7 G
8 G
9 T
10 T
11 T
12 C
LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G
4 G
5 T
6 C
7 G
8 G
9 T
10 T
11 T
12 C
LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T
6 C
7 G
8 G
9 T
10 T
11 T
12 C
LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C
7 G
8 G
9 T
10 T
11 T
12 C
LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T
11 T
12 C
LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T GGTT 6
11 T
12 C
LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T GGTT 6
11 T
12 C TC 7
LZ-complexity (Example)
• S = ATGGTCGGTTTC
• The complexity 𝑐 𝑆 of the
sequence S is
• 𝑐 𝑆 = 7
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T GGTT 6
11 T
12 C TC 7
LZ-complexity (Example)
• S = ATGGTCGGTTTC
• The complexity 𝑐 𝑆 of the
sequence S is
• 𝑐 𝑆 = 7
• The history of 𝑆 is
• 𝐻 𝑆 = {𝐴, 𝑇, 𝐺, 𝐺𝑇, 𝐶, 𝐺𝐺𝑇𝑇, 𝑇𝐶}
Position Symbol Add to dictionary Index
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T GGTT 6
11 T
12 C TC 7
LZ-complexity based sequence comparison
• Based on the number of components in the LZ-complexity
decomposition of the DNA sequences.
• Given two sequences S and Q decomposed using the LZ-complexity:
𝑆 = 𝑆1 𝑆2…𝑆 𝑘…𝑆 𝑚
𝑄 = 𝑄1 𝑄…𝑄 𝑘…𝑄 𝑛
 𝑚 is the number of fragments of 𝑆
 𝑛 is the number of fragments of 𝑄
LZ-complexity based sequence comparison
• Let 𝜎 be a score function used to build the dynamic programming
matrix. It is defined as follows:
𝜎 𝑆𝑖, _ = 𝜎 _, 𝑄𝑖 = 1
𝜎 𝑆𝑖, 𝑄𝑗 = 1 −
𝑁(𝑆𝑖, 𝑄𝑗)
max 𝑙𝑒𝑛𝑔𝑡ℎ(𝑆𝑖, 𝑄𝑗)
• where 𝑁(𝑆𝑖, 𝑄𝑗) is the number of the same elements of fragment 𝑆𝑖
and 𝑄𝑗.
LZ-complexity based sequence comparison
• The sequence similarity matrix 𝑀 is built using the following
formulas:
𝑀 𝑖, 0 = 𝑘=1
𝑖
𝜎 𝑆𝑖, _
𝑀 0, 𝑗 = 𝑘=1
𝑗
𝜎 _, 𝑄𝑗
𝑀[𝑖, 𝑗] = min
𝑀 𝑖 − 1, 𝑗 + 𝜎(𝑆𝑖, _)
𝑀 𝑖 − 1, 𝑗 − 1 + 𝜎(𝑆𝑖, 𝑄𝑖)
𝑀 𝑖, 𝑗 − 1 + 𝜎(_, 𝑄𝑖)
𝑀 𝑖 − 1, 𝑗 − 1 𝑀 𝑖 − 1, 𝑗
𝑀 𝑖, 𝑗 − 1
Example
Q→ A T G TGA ATGC AT
S↓ 0 1 2 4 8 16 32
A 1 0 1 2 3 4 5
T 2 1 0 1 2 3 4
G 4 2 1 0 1 2 3
GT 8 3 2 1 0.333 1.333 2.333
C 16 4 3 2 1.333 1.083 2.083
GGTT 32 5 4 3 2.333 1.833 1.833
TC 64 6 5 4 3.333 2.833 2.333
Example
𝑀[𝑚, 𝑛] is the similarity distance between sequences 𝑆 and 𝑄
Q→ A T G TGA ATGC AT
S↓ 0 1 2 4 8 16 32
A 1 0 1 2 3 4 5
T 2 1 0 1 2 3 4
G 4 2 1 0 1 2 3
GT 8 3 2 1 0.333 1.333 2.333
C 16 4 3 2 1.333 1.083 2.083
GGTT 32 5 4 3 2.333 1.833 1.833
TC 64 6 5 4 3.333 2.833 2.333
Results
• Data set: sequences of the firtst exon of 𝛽-globin gene of 11 species
• Method:
Calculate the similarity degree among the sequences using the proposed
method (LZ-complexity + dynamic programming)
Arrange all the similarity degrees into a matrix
Put the pair-wise distances into a neighbor-joining program in the PHYLIP
package
Results
Results
G. Huang et al. (2D-graphical method)
• Method based on graphical representation
• Four vector correspond to four groups of nucleotides:
𝐴 → (1, −
3
3)
𝑇 → (1,
3
2)
𝐺 → (1, − 5)
𝐶 → (1, 3)
G. Huang et al. (2D-graphical method)
• DNA sequence can be turned into a graphical curve
G. Huang et al. (2D-graphical method)
• Graphs shows intuitively (dis)similarity between sequences.
G. Huang et al. (2D-graphical method)
• Graphs shows intuitively (dis)similarity between sequences.
G. Huang et al. (2D-graphical method)
• How to compare sequences?
• Similarity among sequences can be quantified by computing distance
between either vectors or points.
• Spatial distances
• Euclidean distance
• Mahalanobis distance
• Standard Euclidean distance
• Cosine similarity
• Stuart et al. (2002)
Euclidean distance
• Given two vectors 𝐴 = {𝑎1, 𝑎2, … , 𝑎 𝑛} and 𝐵 = {𝑏1, 𝑏2, … , 𝑏 𝑛}, the
Euclidean distance is computed as follow:
𝐸𝐷 𝐴, 𝐵 =
𝑖=1
𝑛
𝑎𝑖 − 𝑏𝑖
2
Mahalanobis distance
• The Mahalanobis distance takes into account the data covariance
relationship. It is defined as follow:
𝑀𝐷 𝐴, 𝐵 = 𝐴 − 𝐵 𝐶𝑉−1 𝐴 − 𝐵 ′
• 𝐶𝑉 is the covariance matrix
Standard Euclidean distance
• Standard Euclidean Distance (SED) considers merely the variance of n
variables.
Cosine similarity
• Stuart et al. define a distance using the angles between vectors. It is
defined as follow:
𝐴𝐷 𝐴, 𝐵 =
𝐴 ∙ 𝐵
𝐴 × 𝐵
=
𝑖=1
𝑛
𝑎𝑖 𝑏𝑖
𝑖=1
𝑛
𝑎𝑖
2
𝑖=1
𝑛
𝑏𝑖
2
𝐸𝐴𝐷 𝐴, 𝐵 = − ln 1 + 𝐴𝐷 𝐴, 𝐵 ∕ 2
• Where 𝐴𝐷(𝐴, 𝐵) is the cosine similarity between 𝐴 and 𝐵, 𝐸𝐴𝐷 𝐴, 𝐵
represents the evolutionary distance between 𝐴 and 𝐵.
Results
• Two data sets have been used
• a real sequences set
• Human mithocondrial genome
• a random sequences set
• Obtained by applying random mutation on the real sequences set
(1%, 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% mutatio
n rates)
• Euclidean, SED, Mahalanobis and EAD distance have been used
Results
• 𝑑 𝑥 denotes the distance
between a sequence and its
randomly mutated version.
• The Euclidian distance is more
sensitive to mutation rate than
the other three distance.
Results
• 35 mitochondrial genome sequences from different mammals
(GeneBank db)
• Primates species including human, ape, gorilla, chimpazees, etc. are
grouped together
• Result is in agreement with that obtained by Yu et al.(2010) and Raina
et al. (2005)
Results
Presented methods comparison
LZ-complexity based algorithm 2D-graphic based algorithm
Dynamic programming algorithm Graphical algorithm
LZ-complexity measure Various distances (ED, Mahalanobis,…)
Generic (DNA/proteins) DNA-specific
Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
Presented methods comparison
LZ-complexity based algorithm 2D-graphic based algorithm
Dynamic programming algorithm Graphical algorithm
LZ-complexity measure Various distances (ED, Mahalanobis,…)
Generic (DNA/proteins) DNA-specific
Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
Presented methods comparison
LZ-complexity based algorithm 2D-graphic based algorithm
Dynamic programming algorithm Graphical algorithm
LZ-complexity measure Various distances (ED, Mahalanobis,…)
Generic (DNA/proteins) DNA-specific
Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
Position Symbol Add to dictionary Index Rate
1 A A 1 1
2 T T 2 1
3 G G 3 1
4 G
5 T GT 4 0.80
.. .. .. .. ..
𝐸𝐷 𝐴, 𝐵 =
𝑖=1
𝑛
𝑎𝑖 − 𝑏𝑖
2
Presented methods comparison
LZ-complexity based algorithm 2D-graphic based algorithm
Dynamic programming algorithm Graphical algorithm
LZ-complexity measure Various distances (ED, Mahalanobis,…)
Generic (DNA/proteins) DNA-specific
Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
Presented methods comparison
LZ-complexity based algorithm 2D-graphic based algorithm
Dynamic programming algorithm Graphical algorithm
LZ-complexity measure Various distances (ED, Mahalanobis,…)
Generic (DNA/proteins) DNA-specific
Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
Biological sequences analysis
Biological sequences analysis
Biological sequences analysis
Biological sequences analysis

Biological sequences analysis

  • 1.
    Biological sequences analysis Areview of two alignment-free methods for sequence comparison
  • 2.
    Outline • Introduction tosequence alignment problem • Introduction to alignment-free sequence comparison • An LZ-complexity based alignment method • A 2D graphical alignment method • Methods overall comparison
  • 3.
    Introduction to sequencealignment • Goal: determine if a particular sequence is like another sequence • determine if a database contains a potential homologous sequence.
  • 4.
    Introduction to sequencealignment • Two alignment types are used: global and local • The global approach compares one whole sequence with other entire sequences. • The local method uses a subset of a sequence and attempts to align it to subset of other sequences.
  • 5.
    Introduction to sequencealignment • The global alignment looks for comparison over the entire range of the two sequences involved. GCATTACTAATATATTAGTAAATCAGAGTAGTA ||||||||| || AAGCGAATAATATATTTATACTCAGATTATTGCGCG
  • 6.
    Introduction to sequencealignment • By contrast, when a local alignment is performed, a small seed is uncovered that can be used to quickly extend the alignment. • The initial seed for the alignment: TAT ||| AAGCGAATAATATATTTATACTCAGATTATTGCGCG
  • 7.
    Introduction to sequencealignment • By contrast, when a local alignment is performed, a small seed is uncovered that can be used to quickly extend the alignment. • And now the extended alignment: TATATATTAGTA ||||||||| || AAGCGAATAATATATTTATACTCAGATTATTGCGCG
  • 8.
    Introduction to sequencealignment • How to search similiarities in genetic sequences? • Naive methods: comparing all possibles alignments (extremely slow) • Heuristics methods • Examples: BLAST, FASTA, … • Optimal solution is not guaranteed • Tradeoff: Speed vs Accuracy • Dynamic programming methods • Examples: Needleman & Wunsch, Smith & Waterman
  • 9.
    Introduction to sequencealignment • How to search similiarities in genetic sequences? • Naive methods: comparing all possibles alignments (extremely slow) • Heuristics methods • Examples: BLASTA, FASTA, … • Optimal solution is not guaranteed • Tradeoff: Speed vs Accuracy • Dynamic programming methods • Examples: Needleman & Wunsch, Smith & Waterman • …faster alternatives?
  • 10.
    Alignment-free comparison • Challenge:overcome the traditional alignment-based algorithm inefficiency • Alignment-based methods • Slow • May produce incorrect results when used on more divergent but functionally related sequences
  • 11.
    Alignment-free comparison • Muchfaster than alignment-based methods • most methods work in linear time • Four categories: • methods based on k-mer/word frequency, • methods based on substrings, • methods based on information theory (LZ-complexity based method) and • methods based on graphical representation (2D-graphical method)
  • 12.
    LZ-complexity based sequencecomparison • Method based on information theory • Analysis of DNA/Proteic sequences • Built upon the LZ-complexity measure • Dynamic programming algorithm
  • 13.
    LZ-complexity • Complexity measurefor finite sequences • LZ-complexity as entropy rate estimator for finite sequences • Produces a dictionary of productions for a sequence 𝑆. • “The proposed complexity measure is related to the number of steps in a self-delimiting production process by which a given sequence is presumed to be generated” (Abraham Lempel and Jacob Ziv, "On the Complexity of Individual Sequences“, 1976)
  • 14.
    LZ-complexity (production process) •𝑚-step production process of a finite sequence 𝑆 𝐻 𝑆 = 𝑆 1, ℎ1 ∗ 𝑆 ℎ1 + 1, ℎ2 , … , 𝑆(ℎ 𝑚−1 + 1, ℎ 𝑚) • 𝐻 𝑆 is called history of 𝑆 and 𝐻𝑖 𝑆 = 𝑆(ℎ𝑖−1 + 1, ℎ𝑖) is called the ith component of 𝐻 𝑆 . • Each component 𝐻𝑖 𝑆 is added into a dictionary
  • 15.
    LZ-complexity (algorithm) Initialize thedictionary repeat until the sequence have not been consumed  Add the next symbol to the current subsequence.  If the subsequence is reproducible from the previous history, add to the dictionary and increase index value
  • 16.
    LZ-complexity (algorithm) Initialize thedictionary repeat until the sequence have not been consumed  Add the next symbol to the current subsequence.  If the subsequence is reproducible from the previous history, add to the dictionary and increase index value The production process inserts a comma (',') into a sequence 𝑆 after the creation of each new phrase formed by the concatenation of the longest recognized dictionary phrase and the innovative symbol that follows.
  • 17.
    LZ-complexity (Example) • S= ATGGTCGGTTTC Position Symbol Add to dictionary Index 1 A 2 T 3 G 4 G 5 T 6 C 7 G 8 G 9 T 10 T 11 T 12 C
  • 18.
    LZ-complexity (Example) • S= ATGGTCGGTTTC Position Symbol Add to dictionary Index 1 A A 1 2 T 3 G 4 G 5 T 6 C 7 G 8 G 9 T 10 T 11 T 12 C
  • 19.
    LZ-complexity (Example) • S= ATGGTCGGTTTC Position Symbol Add to dictionary Index 1 A A 1 2 T T 2 3 G 4 G 5 T 6 C 7 G 8 G 9 T 10 T 11 T 12 C
  • 20.
    LZ-complexity (Example) • S= ATGGTCGGTTTC Position Symbol Add to dictionary Index 1 A A 1 2 T T 2 3 G G 3 4 G 5 T 6 C 7 G 8 G 9 T 10 T 11 T 12 C
  • 21.
    LZ-complexity (Example) • S= ATGGTCGGTTTC Position Symbol Add to dictionary Index 1 A A 1 2 T T 2 3 G G 3 4 G 5 T GT 4 6 C 7 G 8 G 9 T 10 T 11 T 12 C
  • 22.
    LZ-complexity (Example) • S= ATGGTCGGTTTC Position Symbol Add to dictionary Index 1 A A 1 2 T T 2 3 G G 3 4 G 5 T GT 4 6 C C 5 7 G 8 G 9 T 10 T 11 T 12 C
  • 23.
    LZ-complexity (Example) • S= ATGGTCGGTTTC Position Symbol Add to dictionary Index 1 A A 1 2 T T 2 3 G G 3 4 G 5 T GT 4 6 C C 5 7 G 8 G 9 T 10 T GGTT 6 11 T 12 C
  • 24.
    LZ-complexity (Example) • S= ATGGTCGGTTTC Position Symbol Add to dictionary Index 1 A A 1 2 T T 2 3 G G 3 4 G 5 T GT 4 6 C C 5 7 G 8 G 9 T 10 T GGTT 6 11 T 12 C TC 7
  • 25.
    LZ-complexity (Example) • S= ATGGTCGGTTTC • The complexity 𝑐 𝑆 of the sequence S is • 𝑐 𝑆 = 7 Position Symbol Add to dictionary Index 1 A A 1 2 T T 2 3 G G 3 4 G 5 T GT 4 6 C C 5 7 G 8 G 9 T 10 T GGTT 6 11 T 12 C TC 7
  • 26.
    LZ-complexity (Example) • S= ATGGTCGGTTTC • The complexity 𝑐 𝑆 of the sequence S is • 𝑐 𝑆 = 7 • The history of 𝑆 is • 𝐻 𝑆 = {𝐴, 𝑇, 𝐺, 𝐺𝑇, 𝐶, 𝐺𝐺𝑇𝑇, 𝑇𝐶} Position Symbol Add to dictionary Index 1 A A 1 2 T T 2 3 G G 3 4 G 5 T GT 4 6 C C 5 7 G 8 G 9 T 10 T GGTT 6 11 T 12 C TC 7
  • 27.
    LZ-complexity based sequencecomparison • Based on the number of components in the LZ-complexity decomposition of the DNA sequences. • Given two sequences S and Q decomposed using the LZ-complexity: 𝑆 = 𝑆1 𝑆2…𝑆 𝑘…𝑆 𝑚 𝑄 = 𝑄1 𝑄…𝑄 𝑘…𝑄 𝑛  𝑚 is the number of fragments of 𝑆  𝑛 is the number of fragments of 𝑄
  • 28.
    LZ-complexity based sequencecomparison • Let 𝜎 be a score function used to build the dynamic programming matrix. It is defined as follows: 𝜎 𝑆𝑖, _ = 𝜎 _, 𝑄𝑖 = 1 𝜎 𝑆𝑖, 𝑄𝑗 = 1 − 𝑁(𝑆𝑖, 𝑄𝑗) max 𝑙𝑒𝑛𝑔𝑡ℎ(𝑆𝑖, 𝑄𝑗) • where 𝑁(𝑆𝑖, 𝑄𝑗) is the number of the same elements of fragment 𝑆𝑖 and 𝑄𝑗.
  • 29.
    LZ-complexity based sequencecomparison • The sequence similarity matrix 𝑀 is built using the following formulas: 𝑀 𝑖, 0 = 𝑘=1 𝑖 𝜎 𝑆𝑖, _ 𝑀 0, 𝑗 = 𝑘=1 𝑗 𝜎 _, 𝑄𝑗 𝑀[𝑖, 𝑗] = min 𝑀 𝑖 − 1, 𝑗 + 𝜎(𝑆𝑖, _) 𝑀 𝑖 − 1, 𝑗 − 1 + 𝜎(𝑆𝑖, 𝑄𝑖) 𝑀 𝑖, 𝑗 − 1 + 𝜎(_, 𝑄𝑖) 𝑀 𝑖 − 1, 𝑗 − 1 𝑀 𝑖 − 1, 𝑗 𝑀 𝑖, 𝑗 − 1
  • 30.
    Example Q→ A TG TGA ATGC AT S↓ 0 1 2 4 8 16 32 A 1 0 1 2 3 4 5 T 2 1 0 1 2 3 4 G 4 2 1 0 1 2 3 GT 8 3 2 1 0.333 1.333 2.333 C 16 4 3 2 1.333 1.083 2.083 GGTT 32 5 4 3 2.333 1.833 1.833 TC 64 6 5 4 3.333 2.833 2.333
  • 31.
    Example 𝑀[𝑚, 𝑛] isthe similarity distance between sequences 𝑆 and 𝑄 Q→ A T G TGA ATGC AT S↓ 0 1 2 4 8 16 32 A 1 0 1 2 3 4 5 T 2 1 0 1 2 3 4 G 4 2 1 0 1 2 3 GT 8 3 2 1 0.333 1.333 2.333 C 16 4 3 2 1.333 1.083 2.083 GGTT 32 5 4 3 2.333 1.833 1.833 TC 64 6 5 4 3.333 2.833 2.333
  • 32.
    Results • Data set:sequences of the firtst exon of 𝛽-globin gene of 11 species • Method: Calculate the similarity degree among the sequences using the proposed method (LZ-complexity + dynamic programming) Arrange all the similarity degrees into a matrix Put the pair-wise distances into a neighbor-joining program in the PHYLIP package
  • 33.
  • 34.
  • 35.
    G. Huang etal. (2D-graphical method) • Method based on graphical representation • Four vector correspond to four groups of nucleotides: 𝐴 → (1, − 3 3) 𝑇 → (1, 3 2) 𝐺 → (1, − 5) 𝐶 → (1, 3)
  • 36.
    G. Huang etal. (2D-graphical method) • DNA sequence can be turned into a graphical curve
  • 37.
    G. Huang etal. (2D-graphical method) • Graphs shows intuitively (dis)similarity between sequences.
  • 38.
    G. Huang etal. (2D-graphical method) • Graphs shows intuitively (dis)similarity between sequences.
  • 39.
    G. Huang etal. (2D-graphical method) • How to compare sequences? • Similarity among sequences can be quantified by computing distance between either vectors or points. • Spatial distances • Euclidean distance • Mahalanobis distance • Standard Euclidean distance • Cosine similarity • Stuart et al. (2002)
  • 40.
    Euclidean distance • Giventwo vectors 𝐴 = {𝑎1, 𝑎2, … , 𝑎 𝑛} and 𝐵 = {𝑏1, 𝑏2, … , 𝑏 𝑛}, the Euclidean distance is computed as follow: 𝐸𝐷 𝐴, 𝐵 = 𝑖=1 𝑛 𝑎𝑖 − 𝑏𝑖 2
  • 41.
    Mahalanobis distance • TheMahalanobis distance takes into account the data covariance relationship. It is defined as follow: 𝑀𝐷 𝐴, 𝐵 = 𝐴 − 𝐵 𝐶𝑉−1 𝐴 − 𝐵 ′ • 𝐶𝑉 is the covariance matrix
  • 42.
    Standard Euclidean distance •Standard Euclidean Distance (SED) considers merely the variance of n variables.
  • 43.
    Cosine similarity • Stuartet al. define a distance using the angles between vectors. It is defined as follow: 𝐴𝐷 𝐴, 𝐵 = 𝐴 ∙ 𝐵 𝐴 × 𝐵 = 𝑖=1 𝑛 𝑎𝑖 𝑏𝑖 𝑖=1 𝑛 𝑎𝑖 2 𝑖=1 𝑛 𝑏𝑖 2 𝐸𝐴𝐷 𝐴, 𝐵 = − ln 1 + 𝐴𝐷 𝐴, 𝐵 ∕ 2 • Where 𝐴𝐷(𝐴, 𝐵) is the cosine similarity between 𝐴 and 𝐵, 𝐸𝐴𝐷 𝐴, 𝐵 represents the evolutionary distance between 𝐴 and 𝐵.
  • 44.
    Results • Two datasets have been used • a real sequences set • Human mithocondrial genome • a random sequences set • Obtained by applying random mutation on the real sequences set (1%, 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% mutatio n rates) • Euclidean, SED, Mahalanobis and EAD distance have been used
  • 45.
    Results • 𝑑 𝑥denotes the distance between a sequence and its randomly mutated version. • The Euclidian distance is more sensitive to mutation rate than the other three distance.
  • 46.
    Results • 35 mitochondrialgenome sequences from different mammals (GeneBank db) • Primates species including human, ape, gorilla, chimpazees, etc. are grouped together • Result is in agreement with that obtained by Yu et al.(2010) and Raina et al. (2005)
  • 47.
  • 48.
    Presented methods comparison LZ-complexitybased algorithm 2D-graphic based algorithm Dynamic programming algorithm Graphical algorithm LZ-complexity measure Various distances (ED, Mahalanobis,…) Generic (DNA/proteins) DNA-specific Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
  • 49.
    Presented methods comparison LZ-complexitybased algorithm 2D-graphic based algorithm Dynamic programming algorithm Graphical algorithm LZ-complexity measure Various distances (ED, Mahalanobis,…) Generic (DNA/proteins) DNA-specific Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
  • 50.
    Presented methods comparison LZ-complexitybased algorithm 2D-graphic based algorithm Dynamic programming algorithm Graphical algorithm LZ-complexity measure Various distances (ED, Mahalanobis,…) Generic (DNA/proteins) DNA-specific Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results Position Symbol Add to dictionary Index Rate 1 A A 1 1 2 T T 2 1 3 G G 3 1 4 G 5 T GT 4 0.80 .. .. .. .. .. 𝐸𝐷 𝐴, 𝐵 = 𝑖=1 𝑛 𝑎𝑖 − 𝑏𝑖 2
  • 51.
    Presented methods comparison LZ-complexitybased algorithm 2D-graphic based algorithm Dynamic programming algorithm Graphical algorithm LZ-complexity measure Various distances (ED, Mahalanobis,…) Generic (DNA/proteins) DNA-specific Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
  • 52.
    Presented methods comparison LZ-complexitybased algorithm 2D-graphic based algorithm Dynamic programming algorithm Graphical algorithm LZ-complexity measure Various distances (ED, Mahalanobis,…) Generic (DNA/proteins) DNA-specific Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results

Editor's Notes

  • #33 PHYLIP is a free package of programs for inferring phylogenies