We describe how BWT works, how it can be useful for compression of DNA reads. In second part, we talk about indexing and querying data transformed with BWT
6. DNA
A T
G C
Your genome is sort like a book of recipes, with a
separate recipe for each type of molecule in your
body
DNA is the kind of molecule that encode your genome, the sum of all your
genetic information, all your genes
7. DNA
These letters stand for different kinds of molecules, different bases.
A for adenine, C for cytosine, G for guanine, and T for thymine
A DNA molecule is shaped like a double helix,
this thing that looks like a twisted ladder.
And the rungs of this ladder are made up of pairs of
bases
We can take the DNA molecule and
turn it into a sequence of letters, a string
10. DNA Sequencer Machine
10
A DNA sequencer is a scientific
instrument used to automate the
DNA sequencing process. Given a
sample of DNA, a DNA sequencer
is used to determine the order of
the four bases:
G (guanine), C (cytosine), A
(adenine) and T (thymine). This is
then reported as a text string,
called a read
21. Burrows-Wheeler transform
a b a a b a $
Reversible permutation of characters of a string, used originally for compression
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
Sort BWT matrix
a b b a $ a a
T BWT(T)
How is it useful for compression?
How is it reversible?
23. Burrows-Wheeler transform
23
BWT bears a resemblance to suffix array
BWM(T) SA(T)
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
Sort order is the same whether rows are rotations or suffixes
6
5
2
3
0
4
1
24. Burrows-Wheeler transform
24
In fact, this gives us a new definition/way to construct BWT(T):
BWM(T) SA(T)
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
$
a $
a a b a $
a b a $
a b a a b a $
b a $
b a a b a $
“BWT = characters just to the left of the suffixes in the suffix array”
6
5
2
3
0
4
1
25. Burrows-Wheeler transform
25
a b a a b a $
How to reverse the BWT?
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
Sort BWT matrix
a b b a $ a a
T BWT(T)
?
BWM has a key property called
LF mapping
26. Burrows-Wheeler transform: T-ranking
26
aO bO a1 a2 b1 a3 $
Give each character in T a rank , equal to # times the character occurred
previously in T. Call this the T-ranking.
Ranks aren’t explicitly stored; they are just for illustration
Now, let's rewrite BWM including ranks...
28. Burrows-Wheeler transform
28
BWM with ranking
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L
Look at the first and the last columns, called
F and L
29. Burrows-Wheeler transform
29
BWM with ranking
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L
Look at the first and the last columns, called
F and L
And look at just as
as occurs in the same order in F and in L
31. Burrows-Wheeler transform
31
BWM with ranking:
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L
LF mapping: The ith
occurrence of the character c
in L and the ith
occurrence of c in F correspond to
the same occurence in T (i.e. have same rank)
However we rank occurrences of c, ranks appears
in the same order in F and L
35. Burrows-Wheeler transform
35
Say T has 300 As, 400 Cs, 250 Gs and 700 Ts and $ < A < C < G < T
Which BWM row (0-based) begins with G100? (Ranks are B-ranks)
✘ Skip row starting with $
✘ Skip 300 A
✘ Skip 400 C
✘ Skip first 100 rows starting with G (100 rows)
Answer: 1 + 300 + 400 + 100 = row 801
36. Burrows-Wheeler transform
36
a0
b0
b1
a1
$
a2
a3
F L
$
a0
a1
a2
a3
b0
b1
Reverse BWT(T) starting at right-hand-side of T and moving left
Start in first row. F must have $. L contains character just prior to $: a0
a0 : LF mapping says this is the same occurence of a in as first a in F.
Jump to row beginning with a0.
L contains character just prior to a0: b0
Repeat for b0, get a2
Repeat for a2, get a1
Repeat for a1, get b1
Repeat for b1, get a3
Repeat for a3, get $, done
Reverse of characters we visited =
a3 b1 a1 a2 b0 a0 $ = T
37. Burrows-Wheeler transform
37
1. We’ve seen how BWT is useful for compression:
a. Sorts characters by right-context, making a more compressible string
2. And how it’s reversible:
a. Repeated applications of LF Mapping, recreating T from right to left
How is it used to index?
39. FM-index
39
FM-index: an index combining the BWT with a few small auxiliary data structures
Core of index consists of F and L in BWM:
➔ F can be represented very simply (1 integer per alphabet character)
➔ L is compressible
➔ Potentially very space economical!
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
F L
Not stored in index
40. FM-index: querying
40
Though BWM is related to suffix array, we can’t query it the same way
$
a $
a a b a $
a b a $
a b a a b a $
b a $
b a a b a $
6
5
2
3
0
4
1
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
We don’t have these columns. Binary search is not possible
41. FM-index: querying
41
Look for range of rows of BWM(T) with P as prefix
Do this for the P’s shortest suffix, then extend to successively longer suffix until
range becomes empty or we have exhausted P
F L
P = aba
Easy to find all rows beginning
with a, thanks to F’s simple
structure
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
P = aba
42. FM-index: querying
42
We have row beginning with a, now we seek rows beginning with ba
F L
P = aba
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
Look at those
rows in L.
b₀, b₁ are bs
appearing just to
left.
F L
P = aba
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
Use LF Mapping. Let new
range delimit those bs.
43. FM-index: querying
43
We have row beginning with ba, now we seek rows beginning with aba
F L
P = aba
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
F L
P = aba
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
a₂ a₃ occur just
to left.
Use LF Mapping
44. FM-index: querying
44
Now we have the same range [3, 5[
We would have got from querying suffix array
P = aba
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃Where are these?
[3, 5[
$
a $
a a b a $
a b a $
a b a a b a $
b a $
b a a b a $
6
5
2
3
0
4
1
Unlike the suffix array, we do not immediately
know where the matches are in T...
[3, 5[
Note: when P does not occur in T,
it fails to find next character in L
45. FM-index: querying
If we scan characters in the last column, that can be very slow, O(m)
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
Scan, looking for bs
P = abaP = aba
45
46. FM-index: issues
46
1. Scanning for preceding characters is slow
2. Storing ranks takes too much space
3. Need way to find where matches occur in T
47. FM-index: fast rank calculations
47
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
Is there an O(1) way to
determine which bs precede
the as in our range?
Idea: pre-calculate number
of as, bs in L up to every row
F L a b
$ a 1 0
a b 1 1
a b 1 2
a a 2 2
a $ 2 2
b a 3 2
b a 4 2
Tally
O(1) time, but requires
m x |Σ| integers
48. FM-index: fast rank calculations
48
Another idea: pre-calculate # as, bs up to
some rows, e.g. every 5th row.
Call pre-calculated rows checkpoints F L a b
$ a 1 0
a b
a b
a a
a $
b a 3 2
b a
Tally
Lookup here
succeeds as
usual
Ooops, not a
checkpoint here
But here there’s
one nearby
To resolve a lookup for a character c in
non-checkpoint row, scan along L until we
get to nearest checkpoint. Use tally as the
checkpoint, adjusted for #of cs we saw
along the way
49. FM-index: fast rank calculations
49
Another example
L a b
... ... ...
b 234 222
b
a
b
b
a
b
a
a 238 226
... ... ...
Tally
What is my rank?
222 + 2 - 1 = 223
What is my rank?
238 - 1 - 1 = 236
Assuming checkpoints are
spaced O(1) distance apart, so
lookups are O(1)
50. FM-index: a few problems
50
SOLVED! At the expense of adding checkpoints to index -> O(m) integers
● With checkpoints scan takes time O(1)
● With checkpoints we greatly reduce number of integers needed for ranks
○ But it is still O(m) space. There’s a literature to improve this space bound
NOT YET RESOLVED: need a way to find where these occurrences are in T
51. FM-index & Reads
Compression
A more space-efficient search for data occurrences
& Compression idea for read
> D’Avino Ferdinando
52. Find Occurrences in T
52
Need a way to find where some occurrences are in T
A Naive Method : STORE SUFFIX ARRAY AND LOOK UP
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
$
a $
a a b a $
a b a $
a b a a b a $
b a $
b a a b a $
6
5
2
3
0
4
1
Offset : 0, 3
str : aba
53. Find Occurrences in T
53
Another Idea: STORE SOME ENTRIES AND LOOK UP USING FM-INDEX
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
$
a a b a $
a b a a b a $
b a $
6
x
2
x
0
4
x
str : aba
aba INDEX: ?
aaba INDEX: 2
aba INDEX: 3
IMPORTANT: The entries of SA to store are selected in such a way as to have a
constant distant from each other. In our example this distance is 2.
54. FM Index: Small memory footprint
54
Component of the FM Index
First Column (F) : ~ | ∑ | integers
Last Column (L) : m characters
SA sample: (m ・a) integers, where a is the number of rows kept
Checkpoints: (m ・ | ∑ | ・b) integers, where b is the number of rows checkpointed
55. FM Index: Small memory footprint
55
DNA alphabet (2 bit per nucleotide), T=human genome, a=1/32 b=1/128
First Column (F) : 4byte ・ 4 = 16 bytes
Last Column (L) : 2bit ・3 billion chars = 750 MB
SA sample: (3 billion chars ・4byte)/32 = ~ 400 MB
Checkpoints: (3 billion chars ・4byte)/128 = ~ 100 MB
Total < 1.5 GB
56. Compression of BWT Strings
56
Lots of possible compression schemes will benefit from preprocessing with BWT,
because it tends to group runs of same letters together
Ferragina & Manzini scheme
57. Move to First Transform
57
The main idea is to replace every symbol with its index in the stack
of “recently used symbol”
Long sequences of identical symbol are replaced by as many 0s
58. Move to First Transform
58
∑ = {A C G T}...GCGACCT...
Δ = {0 1 2 3}
GCGACCT 2 A C G T
GCGACCT 2,2 G A C T
GCGACCT 2,2,1 C G A T
GCGACCT 2,2,1,2 G C A T
GCGACCT 2,2,1,2,2 A G C T
GCGACCT 2,2,1,2,2,0 C A G T
GCGACCT 2,2,1,2,2,0,3 C A G T
FINAL 2,2,1,2,2,0,3 C A G T MTF (GCGACCT)
59. RLE Operation & Prefix Code
59
RLE Operation trivially consist in replacing long sequences of 0s
with an integer representing the length of the sequences
The output at this point can actually be compressed by prefix-based algorithms,
such as Huffman compression or arithmetic compression
61. Possible future developments
61
Try to extends RLE Operation to replace all long sequences of the same char
Try other algorithms for the final phase
62. THANKS!
Any questions?
You can find us at
✘ f.davino10@studenti.unisa.it
✘ l.sarto1@studenti.unisa.it
✘ s.shevchenko@studenti.unisa.it
62