Burrows-Wheeler transform for terabases

Burrows-Wheeler
transform for
terabasesBased on work of
Jouni Sirén
Wellcome Trust Sanger Institute

HELLO!
We are ...
2
D’Avino Ferdinando
Sarto Lino
Shevchenko Sergiy

Contents
➔ DNA, reads and strings
➔ Burrows Wheeler transform
➔ FM-index
➔ FM-index & Reads Compression
3

Introduction
DNA, reads and strings
> D’Avino Ferdinando

DNA
A T
G C
Your genome is sort like a book of recipes, with a
separate recipe for each type of molecule in your
body
DNA is the kind of molecule that encode your genome, the sum of all your
genetic information, all your genes

DNA
These letters stand for different kinds of molecules, different bases.
A for adenine, C for cytosine, G for guanine, and T for thymine
A DNA molecule is shaped like a double helix,
this thing that looks like a twisted ladder.
And the rungs of this ladder are made up of pairs of
bases
We can take the DNA molecule and
turn it into a sequence of letters, a string

DNA Sequencer Machine
10
A DNA sequencer is a scientific
instrument used to automate the
DNA sequencing process. Given a
sample of DNA, a DNA sequencer
is used to determine the order of
the four bases:
G (guanine), C (cytosine), A
(adenine) and T (thymine). This is
then reported as a text string,
called a read

DNA
Randomly selected snippets out
from the middle of the input DNA,
many, many, many, many of them
Reads
Your Genome

3 Gbp 100 Gbp
A sequence machine produces a lot of
redundant data
17
Of human genome Of reads

1 Byte = 4 Base pairs
[A-T] [T-A] [C-G] [G-C] = [01] [10] [00] [11]
1.5GB
46 Chromosomes
23 from each parent
18

1.5 10 GB
Whoa! That’s a big number, aren’t you proud?
19
14
All genome sequences in human body
x

Burrows-Wheeler transform
A first approach in space optimization
> Shevchenko Sergiy

a b a a b a $
Reversible permutation of characters of a string, used originally for compression
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
Sort BWT matrix
a b b a $ a a
T BWT(T)
How is it useful for compression?
How is it reversible?

Examples
BURROWS.WHEELER.AND.BURROWS.WHEELER$
RRDSS…$NHHEELLWWEEARREERRUUWWBB..OO
TODAY.OR.TOMORROW$
YRDOOTT.MROROW$.OA

23
BWT bears a resemblance to suffix array
BWM(T) SA(T)
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
Sort order is the same whether rows are rotations or suffixes
6
5
2
3
0
4
1

24
In fact, this gives us a new definition/way to construct BWT(T):
BWM(T) SA(T)
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
$
a $
a a b a $
a b a $
a b a a b a $
b a $
b a a b a $
“BWT = characters just to the left of the suffixes in the suffix array”
6
5
2
3
0
4
1

25
a b a a b a $
How to reverse the BWT?
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
Sort BWT matrix
a b b a $ a a
T BWT(T)
?
BWM has a key property called
LF mapping

Burrows-Wheeler transform: T-ranking
26
aO bO a1 a2 b1 a3 $
Give each character in T a rank , equal to # times the character occurred
previously in T. Call this the T-ranking.
Ranks aren’t explicitly stored; they are just for illustration
Now, let's rewrite BWM including ranks...

27
BWM with ranking
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0

28
BWM with ranking
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L
Look at the first and the last columns, called
F and L

29
BWM with ranking
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L
Look at the first and the last columns, called
F and L
And look at just as
as occurs in the same order in F and in L

30
BWM with ranking
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L
Same with bs

31
BWM with ranking:
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L
LF mapping: The ith
occurrence of the character c
in L and the ith
occurrence of c in F correspond to
the same occurence in T (i.e. have same rank)
However we rank occurrences of c, ranks appears
in the same order in F and L

32
BWM with T-ranking:
$ a0 b0 a1 a2 b1 a3
a0 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L

33
BWM with B-ranking:
$ a3 b1 a1 a2 b0 a0
a0 $ a3 b1 a1 a2 b0
a1 a2 b0 a3 $ a3 b1
a2 b0 a0 $ a3 b1 a1
a3 b1 a1 a2 b0 a0 $
b0 a0 $ a3 b1 a1 a2
b1 a1 a2 b0 a0 $ a3
F L
Ascending orderF now has a very simple structure:
a $, a block of as with ascending
ranks, as block of bs with
ascending ranks

34
a0
b0
b1
a1
$
a2
a3
F L
$
a0
a1
a2
a3
b0
b1
Which BWM row begins with b1?
Skip row starting with $(1 row)
Skip rows starting with a(4 rows)
Skip row starting with b0 (1 row)
Answer: row 6row 6

35
Say T has 300 As, 400 Cs, 250 Gs and 700 Ts and $ < A < C < G < T
Which BWM row (0-based) begins with G100? (Ranks are B-ranks)
✘ Skip row starting with $
✘ Skip 300 A
✘ Skip 400 C
✘ Skip first 100 rows starting with G (100 rows)
Answer: 1 + 300 + 400 + 100 = row 801

36
a0
b0
b1
a1
$
a2
a3
F L
$
a0
a1
a2
a3
b0
b1
Reverse BWT(T) starting at right-hand-side of T and moving left
Start in first row. F must have $. L contains character just prior to $: a0
a0 : LF mapping says this is the same occurence of a in as first a in F.
Jump to row beginning with a0.
L contains character just prior to a0: b0
Repeat for b0, get a2
Repeat for a2, get a1
Repeat for a1, get b1
Repeat for b1, get a3
Repeat for a3, get $, done
Reverse of characters we visited =
a3 b1 a1 a2 b0 a0 $ = T

37
1. We’ve seen how BWT is useful for compression:
a. Sorts characters by right-context, making a more compressible string
2. And how it’s reversible:
a. Repeated applications of LF Mapping, recreating T from right to left
How is it used to index?

FM-index
A more space-efficient alternative to suffix array
> Sarto Lino

FM-index
39
FM-index: an index combining the BWT with a few small auxiliary data structures
Core of index consists of F and L in BWM:
➔ F can be represented very simply (1 integer per alphabet character)
➔ L is compressible
➔ Potentially very space economical!
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
F L
Not stored in index

FM-index: querying
40
Though BWM is related to suffix array, we can’t query it the same way
$
a $
a a b a $
a b a $
a b a a b a $
b a $
b a a b a $
6
5
2
3
0
4
1
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
We don’t have these columns. Binary search is not possible

FM-index: querying
41
Look for range of rows of BWM(T) with P as prefix
Do this for the P’s shortest suffix, then extend to successively longer suffix until
range becomes empty or we have exhausted P
F L
P = aba
Easy to find all rows beginning
with a, thanks to F’s simple
structure
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
P = aba

FM-index: querying
42
We have row beginning with a, now we seek rows beginning with ba
F L
P = aba
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
Look at those
rows in L.
b₀, b₁ are bs
appearing just to
left.
F L
P = aba
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
Use LF Mapping. Let new
range delimit those bs.

FM-index: querying
43
We have row beginning with ba, now we seek rows beginning with aba
F L
P = aba
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
F L
P = aba
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
a₂ a₃ occur just
to left.
Use LF Mapping

FM-index: querying
44
Now we have the same range [3, 5[
We would have got from querying suffix array
P = aba
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃Where are these?
[3, 5[
$
a $
a a b a $
a b a $
a b a a b a $
b a $
b a a b a $
6
5
2
3
0
4
1
Unlike the suffix array, we do not immediately
know where the matches are in T...
[3, 5[
Note: when P does not occur in T,
it fails to find next character in L

FM-index: querying
If we scan characters in the last column, that can be very slow, O(m)
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
Scan, looking for bs
P = abaP = aba
45

FM-index: issues
46
1. Scanning for preceding characters is slow
2. Storing ranks takes too much space
3. Need way to find where matches occur in T

FM-index: fast rank calculations
47
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
Is there an O(1) way to
determine which bs precede
the as in our range?
Idea: pre-calculate number
of as, bs in L up to every row
F L a b
$ a 1 0
a b 1 1
a b 1 2
a a 2 2
a $ 2 2
b a 3 2
b a 4 2
Tally
O(1) time, but requires
m x |Σ| integers

48
Another idea: pre-calculate # as, bs up to
some rows, e.g. every 5th row.
Call pre-calculated rows checkpoints F L a b
$ a 1 0
a b
a b
a a
a $
b a 3 2
b a
Tally
Lookup here
succeeds as
usual
Ooops, not a
checkpoint here
But here there’s
one nearby
To resolve a lookup for a character c in
non-checkpoint row, scan along L until we
get to nearest checkpoint. Use tally as the
checkpoint, adjusted for #of cs we saw
along the way

49
Another example
L a b
... ... ...
b 234 222
b
a
b
b
a
b
a
a 238 226
... ... ...
Tally
What is my rank?
222 + 2 - 1 = 223
What is my rank?
238 - 1 - 1 = 236
Assuming checkpoints are
spaced O(1) distance apart, so
lookups are O(1)

FM-index: a few problems
50
SOLVED! At the expense of adding checkpoints to index -> O(m) integers
● With checkpoints scan takes time O(1)
● With checkpoints we greatly reduce number of integers needed for ranks
○ But it is still O(m) space. There’s a literature to improve this space bound
NOT YET RESOLVED: need a way to find where these occurrences are in T

FM-index & Reads
Compression
A more space-efficient search for data occurrences
& Compression idea for read
> D’Avino Ferdinando

Find Occurrences in T
52
Need a way to find where some occurrences are in T
A Naive Method : STORE SUFFIX ARRAY AND LOOK UP
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
$
a $
a a b a $
a b a $
a b a a b a $
b a $
b a a b a $
6
5
2
3
0
4
1
Offset : 0, 3
str : aba

Find Occurrences in T
53
Another Idea: STORE SOME ENTRIES AND LOOK UP USING FM-INDEX
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
$
a a b a $
a b a a b a $
b a $
6
x
2
x
0
4
x
str : aba
aba INDEX: ?
aaba INDEX: 2
aba INDEX: 3
IMPORTANT: The entries of SA to store are selected in such a way as to have a
constant distant from each other. In our example this distance is 2.

FM Index: Small memory footprint
54
Component of the FM Index
First Column (F) : ~ | ∑ | integers
Last Column (L) : m characters
SA sample: (m ・a) integers, where a is the number of rows kept
Checkpoints: (m ・ | ∑ | ・b) integers, where b is the number of rows checkpointed

FM Index: Small memory footprint
55
DNA alphabet (2 bit per nucleotide), T=human genome, a=1/32 b=1/128
First Column (F) : 4byte ・ 4 = 16 bytes
Last Column (L) : 2bit ・3 billion chars = 750 MB
SA sample: (3 billion chars ・4byte)/32 = ~ 400 MB
Checkpoints: (3 billion chars ・4byte)/128 = ~ 100 MB
Total < 1.5 GB

Compression of BWT Strings
56
Lots of possible compression schemes will benefit from preprocessing with BWT,
because it tends to group runs of same letters together
Ferragina & Manzini scheme

Move to First Transform
57
The main idea is to replace every symbol with its index in the stack
of “recently used symbol”
Long sequences of identical symbol are replaced by as many 0s

Move to First Transform
58
∑ = {A C G T}...GCGACCT...
Δ = {0 1 2 3}
GCGACCT 2 A C G T
GCGACCT 2,2 G A C T
GCGACCT 2,2,1 C G A T
GCGACCT 2,2,1,2 G C A T
GCGACCT 2,2,1,2,2 A G C T
GCGACCT 2,2,1,2,2,0 C A G T
GCGACCT 2,2,1,2,2,0,3 C A G T
FINAL 2,2,1,2,2,0,3 C A G T MTF (GCGACCT)

RLE Operation & Prefix Code
59
RLE Operation trivially consist in replacing long sequences of 0s
with an integer representing the length of the sequences
The output at this point can actually be compressed by prefix-based algorithms,
such as Huffman compression or arithmetic compression

Possible future
developments
60

Possible future developments
61
Try to extends RLE Operation to replace all long sequences of the same char
Try other algorithms for the final phase

THANKS!
Any questions?
You can find us at
✘ f.davino10@studenti.unisa.it
✘ l.sarto1@studenti.unisa.it
✘ s.shevchenko@studenti.unisa.it
62

Burrows-Wheeler transform for terabases

Recommended

Recommended

More Related Content

More from Sergio Shevchenko

More from Sergio Shevchenko (8)

Recently uploaded

Recently uploaded (20)

Burrows-Wheeler transform for terabases