Biological sequences analysis

Biological sequences analysis
A review of two alignment-free methods for sequence comparison

Outline
• Introduction to sequence alignment problem
• Introduction to alignment-free sequence comparison
• An LZ-complexity based alignment method
• A 2D graphical alignment method
• Methods overall comparison

Introduction to sequence alignment
• Goal: determine if a particular sequence is like another sequence
• determine if a database contains a potential homologous sequence.

• Two alignment types are used: global and local
• The global approach compares one whole sequence with other entire
sequences.
• The local method uses a subset of a sequence and attempts to align it to
subset of other sequences.

• The global alignment looks for comparison over the entire range of
the two sequences involved.
GCATTACTAATATATTAGTAAATCAGAGTAGTA
||||||||| ||
AAGCGAATAATATATTTATACTCAGATTATTGCGCG

• By contrast, when a local alignment is performed, a small seed is
uncovered that can be used to quickly extend the alignment.
• The initial seed for the alignment:
TAT
|||

• By contrast, when a local alignment is performed, a small seed is
uncovered that can be used to quickly extend the alignment.
• And now the extended alignment:
TATATATTAGTA
||||||||| ||

• How to search similiarities in genetic sequences?
• Naive methods: comparing all possibles alignments (extremely slow)
• Heuristics methods
• Examples: BLAST, FASTA, …
• Optimal solution is not guaranteed
• Tradeoff: Speed vs Accuracy
• Dynamic programming methods
• Examples: Needleman & Wunsch, Smith & Waterman

• How to search similiarities in genetic sequences?
• Naive methods: comparing all possibles alignments (extremely slow)
• Heuristics methods
• Examples: BLASTA, FASTA, …
• Optimal solution is not guaranteed
• Tradeoff: Speed vs Accuracy
• Dynamic programming methods
• Examples: Needleman & Wunsch, Smith & Waterman
• …faster alternatives?

Alignment-free comparison
• Challenge: overcome the traditional alignment-based algorithm
inefficiency
• Alignment-based methods
• Slow
• May produce incorrect results when used on more divergent but functionally
related sequences

Alignment-free comparison
• Much faster than alignment-based methods
• most methods work in linear time
• Four categories:
• methods based on k-mer/word frequency,
• methods based on substrings,
• methods based on information theory (LZ-complexity based method) and
• methods based on graphical representation (2D-graphical method)

LZ-complexity based sequence comparison
• Method based on information theory
• Analysis of DNA/Proteic sequences
• Built upon the LZ-complexity measure
• Dynamic programming algorithm

LZ-complexity
• Complexity measure for finite sequences
• LZ-complexity as entropy rate estimator for finite sequences
• Produces a dictionary of productions for a sequence 𝑆.
• “The proposed complexity measure is related to the number of steps
in a self-delimiting production process by which a given sequence is
presumed to be generated” (Abraham Lempel and Jacob Ziv, "On the Complexity of
Individual Sequences“, 1976)

LZ-complexity (production process)
• 𝑚-step production process of a finite sequence 𝑆
𝐻 𝑆 = 𝑆 1, ℎ1 ∗ 𝑆 ℎ1 + 1, ℎ2 , … , 𝑆(ℎ 𝑚−1 + 1, ℎ 𝑚)
• 𝐻 𝑆 is called history of 𝑆 and 𝐻𝑖 𝑆 = 𝑆(ℎ𝑖−1 + 1, ℎ𝑖) is called the
ith component of 𝐻 𝑆 .
• Each component 𝐻𝑖 𝑆 is added into a dictionary

LZ-complexity (algorithm)
Initialize the dictionary
repeat until the sequence have not been consumed
 Add the next symbol to the current subsequence.
 If the subsequence is reproducible from the previous history, add to the
dictionary and increase index value

LZ-complexity (algorithm)
Initialize the dictionary
repeat until the sequence have not been consumed
 Add the next symbol to the current subsequence.
 If the subsequence is reproducible from the previous history, add to the
dictionary and increase index value
The production process inserts a comma (',') into a sequence 𝑆 after the
creation of each new phrase formed by the concatenation of the longest
recognized dictionary phrase and the innovative symbol that follows.

LZ-complexity (Example)
• S = ATGGTCGGTTTC
Position Symbol Add to dictionary Index
1 A
2 T
3 G
4 G
5 T
6 C
7 G
8 G
9 T
10 T
11 T
12 C

1 A A 1
2 T
3 G
4 G
5 T
6 C
7 G
8 G
9 T
10 T
11 T
12 C

1 A A 1
2 T T 2
3 G
4 G
5 T
6 C
7 G
8 G
9 T
10 T
11 T
12 C

1 A A 1
2 T T 2
3 G G 3
4 G
5 T
6 C
7 G
8 G
9 T
10 T
11 T
12 C

1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C
7 G
8 G
9 T
10 T
11 T
12 C

1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T
11 T
12 C

1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T GGTT 6
11 T
12 C

1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T GGTT 6
11 T
12 C TC 7

• The complexity 𝑐 𝑆 of the
sequence S is
• 𝑐 𝑆 = 7
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T GGTT 6
11 T
12 C TC 7

• The complexity 𝑐 𝑆 of the
sequence S is
• 𝑐 𝑆 = 7
• The history of 𝑆 is
• 𝐻 𝑆 = {𝐴, 𝑇, 𝐺, 𝐺𝑇, 𝐶, 𝐺𝐺𝑇𝑇, 𝑇𝐶}
1 A A 1
2 T T 2
3 G G 3
4 G
5 T GT 4
6 C C 5
7 G
8 G
9 T
10 T GGTT 6
11 T
12 C TC 7

• Based on the number of components in the LZ-complexity
decomposition of the DNA sequences.
• Given two sequences S and Q decomposed using the LZ-complexity:
𝑆 = 𝑆1 𝑆2…𝑆 𝑘…𝑆 𝑚
𝑄 = 𝑄1 𝑄…𝑄 𝑘…𝑄 𝑛
 𝑚 is the number of fragments of 𝑆
 𝑛 is the number of fragments of 𝑄

• Let 𝜎 be a score function used to build the dynamic programming
matrix. It is defined as follows:
𝜎 𝑆𝑖, _ = 𝜎 _, 𝑄𝑖 = 1
𝜎 𝑆𝑖, 𝑄𝑗 = 1 −
𝑁(𝑆𝑖, 𝑄𝑗)
max 𝑙𝑒𝑛𝑔𝑡ℎ(𝑆𝑖, 𝑄𝑗)
• where 𝑁(𝑆𝑖, 𝑄𝑗) is the number of the same elements of fragment 𝑆𝑖
and 𝑄𝑗.

• The sequence similarity matrix 𝑀 is built using the following
formulas:
𝑀 𝑖, 0 = 𝑘=1
𝑖
𝜎 𝑆𝑖, _
𝑀 0, 𝑗 = 𝑘=1
𝑗
𝜎 _, 𝑄𝑗
𝑀[𝑖, 𝑗] = min
𝑀 𝑖 − 1, 𝑗 + 𝜎(𝑆𝑖, _)
𝑀 𝑖 − 1, 𝑗 − 1 + 𝜎(𝑆𝑖, 𝑄𝑖)
𝑀 𝑖, 𝑗 − 1 + 𝜎(_, 𝑄𝑖)
𝑀 𝑖 − 1, 𝑗 − 1 𝑀 𝑖 − 1, 𝑗
𝑀 𝑖, 𝑗 − 1

Example
Q→ A T G TGA ATGC AT
S↓ 0 1 2 4 8 16 32
A 1 0 1 2 3 4 5
T 2 1 0 1 2 3 4
G 4 2 1 0 1 2 3
GT 8 3 2 1 0.333 1.333 2.333
C 16 4 3 2 1.333 1.083 2.083
GGTT 32 5 4 3 2.333 1.833 1.833
TC 64 6 5 4 3.333 2.833 2.333

Example
𝑀[𝑚, 𝑛] is the similarity distance between sequences 𝑆 and 𝑄
Q→ A T G TGA ATGC AT
S↓ 0 1 2 4 8 16 32
A 1 0 1 2 3 4 5
T 2 1 0 1 2 3 4
G 4 2 1 0 1 2 3
GT 8 3 2 1 0.333 1.333 2.333
C 16 4 3 2 1.333 1.083 2.083
GGTT 32 5 4 3 2.333 1.833 1.833
TC 64 6 5 4 3.333 2.833 2.333

Results
• Data set: sequences of the firtst exon of 𝛽-globin gene of 11 species
• Method:
Calculate the similarity degree among the sequences using the proposed
method (LZ-complexity + dynamic programming)
Arrange all the similarity degrees into a matrix
Put the pair-wise distances into a neighbor-joining program in the PHYLIP
package

G. Huang et al. (2D-graphical method)
• Method based on graphical representation
• Four vector correspond to four groups of nucleotides:
𝐴 → (1, −
3
3)
𝑇 → (1,
3
2)
𝐺 → (1, − 5)
𝐶 → (1, 3)

• DNA sequence can be turned into a graphical curve

• Graphs shows intuitively (dis)similarity between sequences.

• How to compare sequences?
• Similarity among sequences can be quantified by computing distance
between either vectors or points.
• Spatial distances
• Euclidean distance
• Mahalanobis distance
• Standard Euclidean distance
• Cosine similarity
• Stuart et al. (2002)

Euclidean distance
• Given two vectors 𝐴 = {𝑎1, 𝑎2, … , 𝑎 𝑛} and 𝐵 = {𝑏1, 𝑏2, … , 𝑏 𝑛}, the
Euclidean distance is computed as follow:
𝐸𝐷 𝐴, 𝐵 =
𝑖=1
𝑛
𝑎𝑖 − 𝑏𝑖
2

Mahalanobis distance
• The Mahalanobis distance takes into account the data covariance
relationship. It is defined as follow:
𝑀𝐷 𝐴, 𝐵 = 𝐴 − 𝐵 𝐶𝑉−1 𝐴 − 𝐵 ′
• 𝐶𝑉 is the covariance matrix

Standard Euclidean distance
• Standard Euclidean Distance (SED) considers merely the variance of n
variables.

Cosine similarity
• Stuart et al. define a distance using the angles between vectors. It is
defined as follow:
𝐴𝐷 𝐴, 𝐵 =
𝐴 ∙ 𝐵
𝐴 × 𝐵
=
𝑖=1
𝑛
𝑎𝑖 𝑏𝑖
𝑖=1
𝑛
𝑎𝑖
2
𝑖=1
𝑛
𝑏𝑖
2
𝐸𝐴𝐷 𝐴, 𝐵 = − ln 1 + 𝐴𝐷 𝐴, 𝐵 ∕ 2
• Where 𝐴𝐷(𝐴, 𝐵) is the cosine similarity between 𝐴 and 𝐵, 𝐸𝐴𝐷 𝐴, 𝐵
represents the evolutionary distance between 𝐴 and 𝐵.

Results
• Two data sets have been used
• a real sequences set
• Human mithocondrial genome
• a random sequences set
• Obtained by applying random mutation on the real sequences set
(1%, 5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% mutatio
n rates)
• Euclidean, SED, Mahalanobis and EAD distance have been used

Results
• 𝑑 𝑥 denotes the distance
between a sequence and its
randomly mutated version.
• The Euclidian distance is more
sensitive to mutation rate than
the other three distance.

Results
• 35 mitochondrial genome sequences from different mammals
(GeneBank db)
• Primates species including human, ape, gorilla, chimpazees, etc. are
grouped together
• Result is in agreement with that obtained by Yu et al.(2010) and Raina
et al. (2005)

Presented methods comparison
LZ-complexity based algorithm 2D-graphic based algorithm
Dynamic programming algorithm Graphical algorithm
LZ-complexity measure Various distances (ED, Mahalanobis,…)
Generic (DNA/proteins) DNA-specific
Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results

Presented methods comparison
LZ-complexity based algorithm 2D-graphic based algorithm
Dynamic programming algorithm Graphical algorithm
LZ-complexity measure Various distances (ED, Mahalanobis,…)
Generic (DNA/proteins) DNA-specific
Unrooted Phylogenetic-tree results Rooted Phylogenetic-tree results
Position Symbol Add to dictionary Index Rate
1 A A 1 1
2 T T 2 1
3 G G 3 1
4 G
5 T GT 4 0.80
.. .. .. .. ..
𝐸𝐷 𝐴, 𝐵 =
𝑖=1
𝑛
𝑎𝑖 − 𝑏𝑖
2

Biological sequences analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Biological sequences analysis

Similar to Biological sequences analysis (20)

Recently uploaded

Recently uploaded (20)

Biological sequences analysis

Editor's Notes