SlideShare a Scribd company logo
1 of 62
Download to read offline
Burrows-Wheeler
transform for
terabasesBased on work of
Jouni Sirén
Wellcome Trust Sanger Institute
HELLO!
We are ...
2
D’Avino Ferdinando
Sarto Lino
Shevchenko Sergiy
Contents
➔ DNA, reads and strings
➔ Burrows Wheeler transform
➔ FM-index
➔ FM-index & Reads Compression
3
Introduction
DNA, reads and strings
> D’Avino Ferdinando
5WHAT IS DNA?
DNA
A T
G C
Your genome is sort like a book of recipes, with a
separate recipe for each type of molecule in your
body
DNA is the kind of molecule that encode your genome, the sum of all your
genetic information, all your genes
DNA
These letters stand for different kinds of molecules, different bases.
A for adenine, C for cytosine, G for guanine, and T for thymine
A DNA molecule is shaped like a double helix,
this thing that looks like a twisted ladder.
And the rungs of this ladder are made up of pairs of
bases
We can take the DNA molecule and
turn it into a sequence of letters, a string
DNA
DNA
DNA Sequencer Machine
10
A DNA sequencer is a scientific
instrument used to automate the
DNA sequencing process. Given a
sample of DNA, a DNA sequencer
is used to determine the order of
the four bases:
G (guanine), C (cytosine), A
(adenine) and T (thymine). This is
then reported as a text string,
called a read
DNA
Randomly selected snippets out
from the middle of the input DNA,
many, many, many, many of them
Reads
Your Genome
DNA
Reads
Your Genome
DNA
Reads
Your Genome
DNA
Reads
Your Genome
DNA
Reads
Your Genome
DNA
Reads
Your Genome
3 Gbp 100 Gbp
A sequence machine produces a lot of
redundant data
17
Of human genome Of reads
1 Byte = 4 Base pairs
[A-T] [T-A] [C-G] [G-C] = [01] [10] [00] [11]
1.5GB
46 Chromosomes
23 from each parent
18
1.5 10 GB
Whoa! That’s a big number, aren’t you proud?
19
14
All genome sequences in human body
x
Burrows-Wheeler transform
A first approach in space optimization
> Shevchenko Sergiy
Burrows-Wheeler transform
a b a a b a $
Reversible permutation of characters of a string, used originally for compression
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
Sort BWT matrix
a b b a $ a a
T BWT(T)
How is it useful for compression?
How is it reversible?
Burrows-Wheeler transform
Examples
BURROWS.WHEELER.AND.BURROWS.WHEELER$
RRDSS…$NHHEELLWWEEARREERRUUWWBB..OO
TODAY.OR.TOMORROW$
YRDOOTT.MROROW$.OA
Burrows-Wheeler transform
23
BWT bears a resemblance to suffix array
BWM(T) SA(T)
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
Sort order is the same whether rows are rotations or suffixes
6
5
2
3
0
4
1
Burrows-Wheeler transform
24
In fact, this gives us a new definition/way to construct BWT(T):
BWM(T) SA(T)
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
$
a $
a a b a $
a b a $
a b a a b a $
b a $
b a a b a $
“BWT = characters just to the left of the suffixes in the suffix array”
6
5
2
3
0
4
1
Burrows-Wheeler transform
25
a b a a b a $
How to reverse the BWT?
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
Sort BWT matrix
a b b a $ a a
T BWT(T)
?
BWM has a key property called
LF mapping
Burrows-Wheeler transform: T-ranking
26
aO bO a1 a2 b1 a3 $
Give each character in T a rank , equal to # times the character occurred
previously in T. Call this the T-ranking.
Ranks aren’t explicitly stored; they are just for illustration
Now, let's rewrite BWM including ranks...
Burrows-Wheeler transform
27
BWM with ranking
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
Burrows-Wheeler transform
28
BWM with ranking
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L
Look at the first and the last columns, called
F and L
Burrows-Wheeler transform
29
BWM with ranking
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L
Look at the first and the last columns, called
F and L
And look at just as
as occurs in the same order in F and in L
Burrows-Wheeler transform
30
BWM with ranking
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L
Same with bs
Burrows-Wheeler transform
31
BWM with ranking:
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L
LF mapping: The ith
occurrence of the character c
in L and the ith
occurrence of c in F correspond to
the same occurence in T (i.e. have same rank)
However we rank occurrences of c, ranks appears
in the same order in F and L
Burrows-Wheeler transform
32
BWM with T-ranking:
$ a0 b0 a1 a2 b1 a3
a0 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L
Burrows-Wheeler transform
33
BWM with B-ranking:
$ a3 b1 a1 a2 b0 a0
a0 $ a3 b1 a1 a2 b0
a1 a2 b0 a3 $ a3 b1
a2 b0 a0 $ a3 b1 a1
a3 b1 a1 a2 b0 a0 $
b0 a0 $ a3 b1 a1 a2
b1 a1 a2 b0 a0 $ a3
F L
Ascending orderF now has a very simple structure:
a $, a block of as with ascending
ranks, as block of bs with
ascending ranks
Burrows-Wheeler transform
34
a0
b0
b1
a1
$
a2
a3
F L
$
a0
a1
a2
a3
b0
b1
Which BWM row begins with b1?
Skip row starting with $(1 row)
Skip rows starting with a(4 rows)
Skip row starting with b0 (1 row)
Answer: row 6row 6
Burrows-Wheeler transform
35
Say T has 300 As, 400 Cs, 250 Gs and 700 Ts and $ < A < C < G < T
Which BWM row (0-based) begins with G100? (Ranks are B-ranks)
✘ Skip row starting with $
✘ Skip 300 A
✘ Skip 400 C
✘ Skip first 100 rows starting with G (100 rows)
Answer: 1 + 300 + 400 + 100 = row 801
Burrows-Wheeler transform
36
a0
b0
b1
a1
$
a2
a3
F L
$
a0
a1
a2
a3
b0
b1
Reverse BWT(T) starting at right-hand-side of T and moving left
Start in first row. F must have $. L contains character just prior to $: a0
a0 : LF mapping says this is the same occurence of a in as first a in F.
Jump to row beginning with a0.
L contains character just prior to a0: b0
Repeat for b0, get a2
Repeat for a2, get a1
Repeat for a1, get b1
Repeat for b1, get a3
Repeat for a3, get $, done
Reverse of characters we visited =
a3 b1 a1 a2 b0 a0 $ = T
Burrows-Wheeler transform
37
1. We’ve seen how BWT is useful for compression:
a. Sorts characters by right-context, making a more compressible string
2. And how it’s reversible:
a. Repeated applications of LF Mapping, recreating T from right to left
How is it used to index?
FM-index
A more space-efficient alternative to suffix array
> Sarto Lino
FM-index
39
FM-index: an index combining the BWT with a few small auxiliary data structures
Core of index consists of F and L in BWM:
➔ F can be represented very simply (1 integer per alphabet character)
➔ L is compressible
➔ Potentially very space economical!
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
F L
Not stored in index
FM-index: querying
40
Though BWM is related to suffix array, we can’t query it the same way
$
a $
a a b a $
a b a $
a b a a b a $
b a $
b a a b a $
6
5
2
3
0
4
1
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
We don’t have these columns. Binary search is not possible
FM-index: querying
41
Look for range of rows of BWM(T) with P as prefix
Do this for the P’s shortest suffix, then extend to successively longer suffix until
range becomes empty or we have exhausted P
F L
P = aba
Easy to find all rows beginning
with a, thanks to F’s simple
structure
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
P = aba
FM-index: querying
42
We have row beginning with a, now we seek rows beginning with ba
F L
P = aba
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
Look at those
rows in L.
b₀, b₁ are bs
appearing just to
left.
F L
P = aba
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
Use LF Mapping. Let new
range delimit those bs.
FM-index: querying
43
We have row beginning with ba, now we seek rows beginning with aba
F L
P = aba
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
F L
P = aba
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
a₂ a₃ occur just
to left.
Use LF Mapping
FM-index: querying
44
Now we have the same range [3, 5[
We would have got from querying suffix array
P = aba
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃Where are these?
[3, 5[
$
a $
a a b a $
a b a $
a b a a b a $
b a $
b a a b a $
6
5
2
3
0
4
1
Unlike the suffix array, we do not immediately
know where the matches are in T...
[3, 5[
Note: when P does not occur in T,
it fails to find next character in L
FM-index: querying
If we scan characters in the last column, that can be very slow, O(m)
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
Scan, looking for bs
P = abaP = aba
45
FM-index: issues
46
1. Scanning for preceding characters is slow
2. Storing ranks takes too much space
3. Need way to find where matches occur in T
FM-index: fast rank calculations
47
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
Is there an O(1) way to
determine which bs precede
the as in our range?
Idea: pre-calculate number
of as, bs in L up to every row
F L a b
$ a 1 0
a b 1 1
a b 1 2
a a 2 2
a $ 2 2
b a 3 2
b a 4 2
Tally
O(1) time, but requires
m x |Σ| integers
FM-index: fast rank calculations
48
Another idea: pre-calculate # as, bs up to
some rows, e.g. every 5th row.
Call pre-calculated rows checkpoints F L a b
$ a 1 0
a b
a b
a a
a $
b a 3 2
b a
Tally
Lookup here
succeeds as
usual
Ooops, not a
checkpoint here
But here there’s
one nearby
To resolve a lookup for a character c in
non-checkpoint row, scan along L until we
get to nearest checkpoint. Use tally as the
checkpoint, adjusted for #of cs we saw
along the way
FM-index: fast rank calculations
49
Another example
L a b
... ... ...
b 234 222
b
a
b
b
a
b
a
a 238 226
... ... ...
Tally
What is my rank?
222 + 2 - 1 = 223
What is my rank?
238 - 1 - 1 = 236
Assuming checkpoints are
spaced O(1) distance apart, so
lookups are O(1)
FM-index: a few problems
50
SOLVED! At the expense of adding checkpoints to index -> O(m) integers
● With checkpoints scan takes time O(1)
● With checkpoints we greatly reduce number of integers needed for ranks
○ But it is still O(m) space. There’s a literature to improve this space bound
NOT YET RESOLVED: need a way to find where these occurrences are in T
FM-index & Reads
Compression
A more space-efficient search for data occurrences
& Compression idea for read
> D’Avino Ferdinando
Find Occurrences in T
52
Need a way to find where some occurrences are in T
A Naive Method : STORE SUFFIX ARRAY AND LOOK UP
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
$
a $
a a b a $
a b a $
a b a a b a $
b a $
b a a b a $
6
5
2
3
0
4
1
Offset : 0, 3
str : aba
Find Occurrences in T
53
Another Idea: STORE SOME ENTRIES AND LOOK UP USING FM-INDEX
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
$
a a b a $
a b a a b a $
b a $
6
x
2
x
0
4
x
str : aba
aba INDEX: ?
aaba INDEX: 2
aba INDEX: 3
IMPORTANT: The entries of SA to store are selected in such a way as to have a
constant distant from each other. In our example this distance is 2.
FM Index: Small memory footprint
54
Component of the FM Index
First Column (F) : ~ | ∑ | integers
Last Column (L) : m characters
SA sample: (m ・a) integers, where a is the number of rows kept
Checkpoints: (m ・ | ∑ | ・b) integers, where b is the number of rows checkpointed
FM Index: Small memory footprint
55
DNA alphabet (2 bit per nucleotide), T=human genome, a=1/32 b=1/128
First Column (F) : 4byte ・ 4 = 16 bytes
Last Column (L) : 2bit ・3 billion chars = 750 MB
SA sample: (3 billion chars ・4byte)/32 = ~ 400 MB
Checkpoints: (3 billion chars ・4byte)/128 = ~ 100 MB
Total < 1.5 GB
Compression of BWT Strings
56
Lots of possible compression schemes will benefit from preprocessing with BWT,
because it tends to group runs of same letters together
Ferragina & Manzini scheme
Move to First Transform
57
The main idea is to replace every symbol with its index in the stack
of “recently used symbol”
Long sequences of identical symbol are replaced by as many 0s
Move to First Transform
58
∑ = {A C G T}...GCGACCT...
Δ = {0 1 2 3}
GCGACCT 2 A C G T
GCGACCT 2,2 G A C T
GCGACCT 2,2,1 C G A T
GCGACCT 2,2,1,2 G C A T
GCGACCT 2,2,1,2,2 A G C T
GCGACCT 2,2,1,2,2,0 C A G T
GCGACCT 2,2,1,2,2,0,3 C A G T
FINAL 2,2,1,2,2,0,3 C A G T MTF (GCGACCT)
RLE Operation & Prefix Code
59
RLE Operation trivially consist in replacing long sequences of 0s
with an integer representing the length of the sequences
The output at this point can actually be compressed by prefix-based algorithms,
such as Huffman compression or arithmetic compression
Possible future
developments
60
Possible future developments
61
Try to extends RLE Operation to replace all long sequences of the same char
Try other algorithms for the final phase
THANKS!
Any questions?
You can find us at
✘ f.davino10@studenti.unisa.it
✘ l.sarto1@studenti.unisa.it
✘ s.shevchenko@studenti.unisa.it
62

More Related Content

More from Sergio Shevchenko

More from Sergio Shevchenko (8)

μ-Kernel Evolution
μ-Kernel Evolutionμ-Kernel Evolution
μ-Kernel Evolution
 
Presentazione CERT-CHECK
Presentazione CERT-CHECKPresentazione CERT-CHECK
Presentazione CERT-CHECK
 
Design patterns: Creational patterns
Design patterns: Creational patternsDesign patterns: Creational patterns
Design patterns: Creational patterns
 
Bitcoin and blockchain
Bitcoin and blockchainBitcoin and blockchain
Bitcoin and blockchain
 
Qt Multiplatform development
Qt Multiplatform developmentQt Multiplatform development
Qt Multiplatform development
 
Qt for beginners
Qt for beginnersQt for beginners
Qt for beginners
 
Continuous Integration
Continuous IntegrationContinuous Integration
Continuous Integration
 
Mobile Factor App
Mobile Factor AppMobile Factor App
Mobile Factor App
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Burrows-Wheeler transform for terabases

  • 1. Burrows-Wheeler transform for terabasesBased on work of Jouni Sirén Wellcome Trust Sanger Institute
  • 2. HELLO! We are ... 2 D’Avino Ferdinando Sarto Lino Shevchenko Sergiy
  • 3. Contents ➔ DNA, reads and strings ➔ Burrows Wheeler transform ➔ FM-index ➔ FM-index & Reads Compression 3
  • 4. Introduction DNA, reads and strings > D’Avino Ferdinando
  • 6. DNA A T G C Your genome is sort like a book of recipes, with a separate recipe for each type of molecule in your body DNA is the kind of molecule that encode your genome, the sum of all your genetic information, all your genes
  • 7. DNA These letters stand for different kinds of molecules, different bases. A for adenine, C for cytosine, G for guanine, and T for thymine A DNA molecule is shaped like a double helix, this thing that looks like a twisted ladder. And the rungs of this ladder are made up of pairs of bases We can take the DNA molecule and turn it into a sequence of letters, a string
  • 8. DNA
  • 9. DNA
  • 10. DNA Sequencer Machine 10 A DNA sequencer is a scientific instrument used to automate the DNA sequencing process. Given a sample of DNA, a DNA sequencer is used to determine the order of the four bases: G (guanine), C (cytosine), A (adenine) and T (thymine). This is then reported as a text string, called a read
  • 11. DNA Randomly selected snippets out from the middle of the input DNA, many, many, many, many of them Reads Your Genome
  • 17. 3 Gbp 100 Gbp A sequence machine produces a lot of redundant data 17 Of human genome Of reads
  • 18. 1 Byte = 4 Base pairs [A-T] [T-A] [C-G] [G-C] = [01] [10] [00] [11] 1.5GB 46 Chromosomes 23 from each parent 18
  • 19. 1.5 10 GB Whoa! That’s a big number, aren’t you proud? 19 14 All genome sequences in human body x
  • 20. Burrows-Wheeler transform A first approach in space optimization > Shevchenko Sergiy
  • 21. Burrows-Wheeler transform a b a a b a $ Reversible permutation of characters of a string, used originally for compression $ a b a a b a a $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a Sort BWT matrix a b b a $ a a T BWT(T) How is it useful for compression? How is it reversible?
  • 23. Burrows-Wheeler transform 23 BWT bears a resemblance to suffix array BWM(T) SA(T) $ a b a a b a a $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a $ a b a a b a a $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a Sort order is the same whether rows are rotations or suffixes 6 5 2 3 0 4 1
  • 24. Burrows-Wheeler transform 24 In fact, this gives us a new definition/way to construct BWT(T): BWM(T) SA(T) $ a b a a b a a $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a $ a $ a a b a $ a b a $ a b a a b a $ b a $ b a a b a $ “BWT = characters just to the left of the suffixes in the suffix array” 6 5 2 3 0 4 1
  • 25. Burrows-Wheeler transform 25 a b a a b a $ How to reverse the BWT? $ a b a a b a a $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a Sort BWT matrix a b b a $ a a T BWT(T) ? BWM has a key property called LF mapping
  • 26. Burrows-Wheeler transform: T-ranking 26 aO bO a1 a2 b1 a3 $ Give each character in T a rank , equal to # times the character occurred previously in T. Call this the T-ranking. Ranks aren’t explicitly stored; they are just for illustration Now, let's rewrite BWM including ranks...
  • 27. Burrows-Wheeler transform 27 BWM with ranking $ a0 b0 a1 a2 b1 a3 a3 $ a0 b0 a1 a2 b1 a1 a2 b1 a3 $ a0 b0 a2 b1 a3 $ a0 b0 a1 a0 b0 a1 a2 b1 a3 $ b1 a3 $ a0 b0 a1 a2 b0 a1 a2 b1 a3 $ a0
  • 28. Burrows-Wheeler transform 28 BWM with ranking $ a0 b0 a1 a2 b1 a3 a3 $ a0 b0 a1 a2 b1 a1 a2 b1 a3 $ a0 b0 a2 b1 a3 $ a0 b0 a1 a0 b0 a1 a2 b1 a3 $ b1 a3 $ a0 b0 a1 a2 b0 a1 a2 b1 a3 $ a0 F L Look at the first and the last columns, called F and L
  • 29. Burrows-Wheeler transform 29 BWM with ranking $ a0 b0 a1 a2 b1 a3 a3 $ a0 b0 a1 a2 b1 a1 a2 b1 a3 $ a0 b0 a2 b1 a3 $ a0 b0 a1 a0 b0 a1 a2 b1 a3 $ b1 a3 $ a0 b0 a1 a2 b0 a1 a2 b1 a3 $ a0 F L Look at the first and the last columns, called F and L And look at just as as occurs in the same order in F and in L
  • 30. Burrows-Wheeler transform 30 BWM with ranking $ a0 b0 a1 a2 b1 a3 a3 $ a0 b0 a1 a2 b1 a1 a2 b1 a3 $ a0 b0 a2 b1 a3 $ a0 b0 a1 a0 b0 a1 a2 b1 a3 $ b1 a3 $ a0 b0 a1 a2 b0 a1 a2 b1 a3 $ a0 F L Same with bs
  • 31. Burrows-Wheeler transform 31 BWM with ranking: $ a0 b0 a1 a2 b1 a3 a3 $ a0 b0 a1 a2 b1 a1 a2 b1 a3 $ a0 b0 a2 b1 a3 $ a0 b0 a1 a0 b0 a1 a2 b1 a3 $ b1 a3 $ a0 b0 a1 a2 b0 a1 a2 b1 a3 $ a0 F L LF mapping: The ith occurrence of the character c in L and the ith occurrence of c in F correspond to the same occurence in T (i.e. have same rank) However we rank occurrences of c, ranks appears in the same order in F and L
  • 32. Burrows-Wheeler transform 32 BWM with T-ranking: $ a0 b0 a1 a2 b1 a3 a0 $ a0 b0 a1 a2 b1 a1 a2 b1 a3 $ a0 b0 a2 b1 a3 $ a0 b0 a1 a0 b0 a1 a2 b1 a3 $ b1 a3 $ a0 b0 a1 a2 b0 a1 a2 b1 a3 $ a0 F L
  • 33. Burrows-Wheeler transform 33 BWM with B-ranking: $ a3 b1 a1 a2 b0 a0 a0 $ a3 b1 a1 a2 b0 a1 a2 b0 a3 $ a3 b1 a2 b0 a0 $ a3 b1 a1 a3 b1 a1 a2 b0 a0 $ b0 a0 $ a3 b1 a1 a2 b1 a1 a2 b0 a0 $ a3 F L Ascending orderF now has a very simple structure: a $, a block of as with ascending ranks, as block of bs with ascending ranks
  • 34. Burrows-Wheeler transform 34 a0 b0 b1 a1 $ a2 a3 F L $ a0 a1 a2 a3 b0 b1 Which BWM row begins with b1? Skip row starting with $(1 row) Skip rows starting with a(4 rows) Skip row starting with b0 (1 row) Answer: row 6row 6
  • 35. Burrows-Wheeler transform 35 Say T has 300 As, 400 Cs, 250 Gs and 700 Ts and $ < A < C < G < T Which BWM row (0-based) begins with G100? (Ranks are B-ranks) ✘ Skip row starting with $ ✘ Skip 300 A ✘ Skip 400 C ✘ Skip first 100 rows starting with G (100 rows) Answer: 1 + 300 + 400 + 100 = row 801
  • 36. Burrows-Wheeler transform 36 a0 b0 b1 a1 $ a2 a3 F L $ a0 a1 a2 a3 b0 b1 Reverse BWT(T) starting at right-hand-side of T and moving left Start in first row. F must have $. L contains character just prior to $: a0 a0 : LF mapping says this is the same occurence of a in as first a in F. Jump to row beginning with a0. L contains character just prior to a0: b0 Repeat for b0, get a2 Repeat for a2, get a1 Repeat for a1, get b1 Repeat for b1, get a3 Repeat for a3, get $, done Reverse of characters we visited = a3 b1 a1 a2 b0 a0 $ = T
  • 37. Burrows-Wheeler transform 37 1. We’ve seen how BWT is useful for compression: a. Sorts characters by right-context, making a more compressible string 2. And how it’s reversible: a. Repeated applications of LF Mapping, recreating T from right to left How is it used to index?
  • 38. FM-index A more space-efficient alternative to suffix array > Sarto Lino
  • 39. FM-index 39 FM-index: an index combining the BWT with a few small auxiliary data structures Core of index consists of F and L in BWM: ➔ F can be represented very simply (1 integer per alphabet character) ➔ L is compressible ➔ Potentially very space economical! $ a b a a b a a $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a F L Not stored in index
  • 40. FM-index: querying 40 Though BWM is related to suffix array, we can’t query it the same way $ a $ a a b a $ a b a $ a b a a b a $ b a $ b a a b a $ 6 5 2 3 0 4 1 $ a b a a b a a $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a $ a b a a b a a $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a We don’t have these columns. Binary search is not possible
  • 41. FM-index: querying 41 Look for range of rows of BWM(T) with P as prefix Do this for the P’s shortest suffix, then extend to successively longer suffix until range becomes empty or we have exhausted P F L P = aba Easy to find all rows beginning with a, thanks to F’s simple structure $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ P = aba
  • 42. FM-index: querying 42 We have row beginning with a, now we seek rows beginning with ba F L P = aba $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ Look at those rows in L. b₀, b₁ are bs appearing just to left. F L P = aba $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ Use LF Mapping. Let new range delimit those bs.
  • 43. FM-index: querying 43 We have row beginning with ba, now we seek rows beginning with aba F L P = aba $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ F L P = aba $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ a₂ a₃ occur just to left. Use LF Mapping
  • 44. FM-index: querying 44 Now we have the same range [3, 5[ We would have got from querying suffix array P = aba F L $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃Where are these? [3, 5[ $ a $ a a b a $ a b a $ a b a a b a $ b a $ b a a b a $ 6 5 2 3 0 4 1 Unlike the suffix array, we do not immediately know where the matches are in T... [3, 5[ Note: when P does not occur in T, it fails to find next character in L
  • 45. FM-index: querying If we scan characters in the last column, that can be very slow, O(m) F L $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ Scan, looking for bs P = abaP = aba 45
  • 46. FM-index: issues 46 1. Scanning for preceding characters is slow 2. Storing ranks takes too much space 3. Need way to find where matches occur in T
  • 47. FM-index: fast rank calculations 47 F L $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ Is there an O(1) way to determine which bs precede the as in our range? Idea: pre-calculate number of as, bs in L up to every row F L a b $ a 1 0 a b 1 1 a b 1 2 a a 2 2 a $ 2 2 b a 3 2 b a 4 2 Tally O(1) time, but requires m x |Σ| integers
  • 48. FM-index: fast rank calculations 48 Another idea: pre-calculate # as, bs up to some rows, e.g. every 5th row. Call pre-calculated rows checkpoints F L a b $ a 1 0 a b a b a a a $ b a 3 2 b a Tally Lookup here succeeds as usual Ooops, not a checkpoint here But here there’s one nearby To resolve a lookup for a character c in non-checkpoint row, scan along L until we get to nearest checkpoint. Use tally as the checkpoint, adjusted for #of cs we saw along the way
  • 49. FM-index: fast rank calculations 49 Another example L a b ... ... ... b 234 222 b a b b a b a a 238 226 ... ... ... Tally What is my rank? 222 + 2 - 1 = 223 What is my rank? 238 - 1 - 1 = 236 Assuming checkpoints are spaced O(1) distance apart, so lookups are O(1)
  • 50. FM-index: a few problems 50 SOLVED! At the expense of adding checkpoints to index -> O(m) integers ● With checkpoints scan takes time O(1) ● With checkpoints we greatly reduce number of integers needed for ranks ○ But it is still O(m) space. There’s a literature to improve this space bound NOT YET RESOLVED: need a way to find where these occurrences are in T
  • 51. FM-index & Reads Compression A more space-efficient search for data occurrences & Compression idea for read > D’Avino Ferdinando
  • 52. Find Occurrences in T 52 Need a way to find where some occurrences are in T A Naive Method : STORE SUFFIX ARRAY AND LOOK UP F L $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ $ a $ a a b a $ a b a $ a b a a b a $ b a $ b a a b a $ 6 5 2 3 0 4 1 Offset : 0, 3 str : aba
  • 53. Find Occurrences in T 53 Another Idea: STORE SOME ENTRIES AND LOOK UP USING FM-INDEX F L $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ $ a a b a $ a b a a b a $ b a $ 6 x 2 x 0 4 x str : aba aba INDEX: ? aaba INDEX: 2 aba INDEX: 3 IMPORTANT: The entries of SA to store are selected in such a way as to have a constant distant from each other. In our example this distance is 2.
  • 54. FM Index: Small memory footprint 54 Component of the FM Index First Column (F) : ~ | ∑ | integers Last Column (L) : m characters SA sample: (m ・a) integers, where a is the number of rows kept Checkpoints: (m ・ | ∑ | ・b) integers, where b is the number of rows checkpointed
  • 55. FM Index: Small memory footprint 55 DNA alphabet (2 bit per nucleotide), T=human genome, a=1/32 b=1/128 First Column (F) : 4byte ・ 4 = 16 bytes Last Column (L) : 2bit ・3 billion chars = 750 MB SA sample: (3 billion chars ・4byte)/32 = ~ 400 MB Checkpoints: (3 billion chars ・4byte)/128 = ~ 100 MB Total < 1.5 GB
  • 56. Compression of BWT Strings 56 Lots of possible compression schemes will benefit from preprocessing with BWT, because it tends to group runs of same letters together Ferragina & Manzini scheme
  • 57. Move to First Transform 57 The main idea is to replace every symbol with its index in the stack of “recently used symbol” Long sequences of identical symbol are replaced by as many 0s
  • 58. Move to First Transform 58 ∑ = {A C G T}...GCGACCT... Δ = {0 1 2 3} GCGACCT 2 A C G T GCGACCT 2,2 G A C T GCGACCT 2,2,1 C G A T GCGACCT 2,2,1,2 G C A T GCGACCT 2,2,1,2,2 A G C T GCGACCT 2,2,1,2,2,0 C A G T GCGACCT 2,2,1,2,2,0,3 C A G T FINAL 2,2,1,2,2,0,3 C A G T MTF (GCGACCT)
  • 59. RLE Operation & Prefix Code 59 RLE Operation trivially consist in replacing long sequences of 0s with an integer representing the length of the sequences The output at this point can actually be compressed by prefix-based algorithms, such as Huffman compression or arithmetic compression
  • 61. Possible future developments 61 Try to extends RLE Operation to replace all long sequences of the same char Try other algorithms for the final phase
  • 62. THANKS! Any questions? You can find us at ✘ f.davino10@studenti.unisa.it ✘ l.sarto1@studenti.unisa.it ✘ s.shevchenko@studenti.unisa.it 62