SlideShare a Scribd company logo
Burrows-Wheeler
transform for
terabasesBased on work of
Jouni Sirén
Wellcome Trust Sanger Institute
HELLO!
We are ...
2
D’Avino Ferdinando
Sarto Lino
Shevchenko Sergiy
Contents
➔ DNA, reads and strings
➔ Burrows Wheeler transform
➔ FM-index
➔ FM-index & Reads Compression
3
Introduction
DNA, reads and strings
> D’Avino Ferdinando
5WHAT IS DNA?
DNA
A T
G C
Your genome is sort like a book of recipes, with a
separate recipe for each type of molecule in your
body
DNA is the kind of molecule that encode your genome, the sum of all your
genetic information, all your genes
DNA
These letters stand for different kinds of molecules, different bases.
A for adenine, C for cytosine, G for guanine, and T for thymine
A DNA molecule is shaped like a double helix,
this thing that looks like a twisted ladder.
And the rungs of this ladder are made up of pairs of
bases
We can take the DNA molecule and
turn it into a sequence of letters, a string
DNA
DNA
DNA Sequencer Machine
10
A DNA sequencer is a scientific
instrument used to automate the
DNA sequencing process. Given a
sample of DNA, a DNA sequencer
is used to determine the order of
the four bases:
G (guanine), C (cytosine), A
(adenine) and T (thymine). This is
then reported as a text string,
called a read
DNA
Randomly selected snippets out
from the middle of the input DNA,
many, many, many, many of them
Reads
Your Genome
DNA
Reads
Your Genome
DNA
Reads
Your Genome
DNA
Reads
Your Genome
DNA
Reads
Your Genome
DNA
Reads
Your Genome
3 Gbp 100 Gbp
A sequence machine produces a lot of
redundant data
17
Of human genome Of reads
1 Byte = 4 Base pairs
[A-T] [T-A] [C-G] [G-C] = [01] [10] [00] [11]
1.5GB
46 Chromosomes
23 from each parent
18
1.5 10 GB
Whoa! That’s a big number, aren’t you proud?
19
14
All genome sequences in human body
x
Burrows-Wheeler transform
A first approach in space optimization
> Shevchenko Sergiy
Burrows-Wheeler transform
a b a a b a $
Reversible permutation of characters of a string, used originally for compression
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
Sort BWT matrix
a b b a $ a a
T BWT(T)
How is it useful for compression?
How is it reversible?
Burrows-Wheeler transform
Examples
BURROWS.WHEELER.AND.BURROWS.WHEELER$
RRDSS…$NHHEELLWWEEARREERRUUWWBB..OO
TODAY.OR.TOMORROW$
YRDOOTT.MROROW$.OA
Burrows-Wheeler transform
23
BWT bears a resemblance to suffix array
BWM(T) SA(T)
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
Sort order is the same whether rows are rotations or suffixes
6
5
2
3
0
4
1
Burrows-Wheeler transform
24
In fact, this gives us a new definition/way to construct BWT(T):
BWM(T) SA(T)
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
$
a $
a a b a $
a b a $
a b a a b a $
b a $
b a a b a $
“BWT = characters just to the left of the suffixes in the suffix array”
6
5
2
3
0
4
1
Burrows-Wheeler transform
25
a b a a b a $
How to reverse the BWT?
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
Sort BWT matrix
a b b a $ a a
T BWT(T)
?
BWM has a key property called
LF mapping
Burrows-Wheeler transform: T-ranking
26
aO bO a1 a2 b1 a3 $
Give each character in T a rank , equal to # times the character occurred
previously in T. Call this the T-ranking.
Ranks aren’t explicitly stored; they are just for illustration
Now, let's rewrite BWM including ranks...
Burrows-Wheeler transform
27
BWM with ranking
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
Burrows-Wheeler transform
28
BWM with ranking
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L
Look at the first and the last columns, called
F and L
Burrows-Wheeler transform
29
BWM with ranking
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L
Look at the first and the last columns, called
F and L
And look at just as
as occurs in the same order in F and in L
Burrows-Wheeler transform
30
BWM with ranking
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L
Same with bs
Burrows-Wheeler transform
31
BWM with ranking:
$ a0 b0 a1 a2 b1 a3
a3 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L
LF mapping: The ith
occurrence of the character c
in L and the ith
occurrence of c in F correspond to
the same occurence in T (i.e. have same rank)
However we rank occurrences of c, ranks appears
in the same order in F and L
Burrows-Wheeler transform
32
BWM with T-ranking:
$ a0 b0 a1 a2 b1 a3
a0 $ a0 b0 a1 a2 b1
a1 a2 b1 a3 $ a0 b0
a2 b1 a3 $ a0 b0 a1
a0 b0 a1 a2 b1 a3 $
b1 a3 $ a0 b0 a1 a2
b0 a1 a2 b1 a3 $ a0
F L
Burrows-Wheeler transform
33
BWM with B-ranking:
$ a3 b1 a1 a2 b0 a0
a0 $ a3 b1 a1 a2 b0
a1 a2 b0 a3 $ a3 b1
a2 b0 a0 $ a3 b1 a1
a3 b1 a1 a2 b0 a0 $
b0 a0 $ a3 b1 a1 a2
b1 a1 a2 b0 a0 $ a3
F L
Ascending orderF now has a very simple structure:
a $, a block of as with ascending
ranks, as block of bs with
ascending ranks
Burrows-Wheeler transform
34
a0
b0
b1
a1
$
a2
a3
F L
$
a0
a1
a2
a3
b0
b1
Which BWM row begins with b1?
Skip row starting with $(1 row)
Skip rows starting with a(4 rows)
Skip row starting with b0 (1 row)
Answer: row 6row 6
Burrows-Wheeler transform
35
Say T has 300 As, 400 Cs, 250 Gs and 700 Ts and $ < A < C < G < T
Which BWM row (0-based) begins with G100? (Ranks are B-ranks)
✘ Skip row starting with $
✘ Skip 300 A
✘ Skip 400 C
✘ Skip first 100 rows starting with G (100 rows)
Answer: 1 + 300 + 400 + 100 = row 801
Burrows-Wheeler transform
36
a0
b0
b1
a1
$
a2
a3
F L
$
a0
a1
a2
a3
b0
b1
Reverse BWT(T) starting at right-hand-side of T and moving left
Start in first row. F must have $. L contains character just prior to $: a0
a0 : LF mapping says this is the same occurence of a in as first a in F.
Jump to row beginning with a0.
L contains character just prior to a0: b0
Repeat for b0, get a2
Repeat for a2, get a1
Repeat for a1, get b1
Repeat for b1, get a3
Repeat for a3, get $, done
Reverse of characters we visited =
a3 b1 a1 a2 b0 a0 $ = T
Burrows-Wheeler transform
37
1. We’ve seen how BWT is useful for compression:
a. Sorts characters by right-context, making a more compressible string
2. And how it’s reversible:
a. Repeated applications of LF Mapping, recreating T from right to left
How is it used to index?
FM-index
A more space-efficient alternative to suffix array
> Sarto Lino
FM-index
39
FM-index: an index combining the BWT with a few small auxiliary data structures
Core of index consists of F and L in BWM:
➔ F can be represented very simply (1 integer per alphabet character)
➔ L is compressible
➔ Potentially very space economical!
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
F L
Not stored in index
FM-index: querying
40
Though BWM is related to suffix array, we can’t query it the same way
$
a $
a a b a $
a b a $
a b a a b a $
b a $
b a a b a $
6
5
2
3
0
4
1
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
$ a b a a b a
a $ a b a a b
a a b a $ a b
a b a $ a b a
a b a a b a $
b a $ a b a a
b a a b a $ a
We don’t have these columns. Binary search is not possible
FM-index: querying
41
Look for range of rows of BWM(T) with P as prefix
Do this for the P’s shortest suffix, then extend to successively longer suffix until
range becomes empty or we have exhausted P
F L
P = aba
Easy to find all rows beginning
with a, thanks to F’s simple
structure
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
P = aba
FM-index: querying
42
We have row beginning with a, now we seek rows beginning with ba
F L
P = aba
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
Look at those
rows in L.
b₀, b₁ are bs
appearing just to
left.
F L
P = aba
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
Use LF Mapping. Let new
range delimit those bs.
FM-index: querying
43
We have row beginning with ba, now we seek rows beginning with aba
F L
P = aba
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
F L
P = aba
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
a₂ a₃ occur just
to left.
Use LF Mapping
FM-index: querying
44
Now we have the same range [3, 5[
We would have got from querying suffix array
P = aba
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃Where are these?
[3, 5[
$
a $
a a b a $
a b a $
a b a a b a $
b a $
b a a b a $
6
5
2
3
0
4
1
Unlike the suffix array, we do not immediately
know where the matches are in T...
[3, 5[
Note: when P does not occur in T,
it fails to find next character in L
FM-index: querying
If we scan characters in the last column, that can be very slow, O(m)
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
Scan, looking for bs
P = abaP = aba
45
FM-index: issues
46
1. Scanning for preceding characters is slow
2. Storing ranks takes too much space
3. Need way to find where matches occur in T
FM-index: fast rank calculations
47
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
Is there an O(1) way to
determine which bs precede
the as in our range?
Idea: pre-calculate number
of as, bs in L up to every row
F L a b
$ a 1 0
a b 1 1
a b 1 2
a a 2 2
a $ 2 2
b a 3 2
b a 4 2
Tally
O(1) time, but requires
m x |Σ| integers
FM-index: fast rank calculations
48
Another idea: pre-calculate # as, bs up to
some rows, e.g. every 5th row.
Call pre-calculated rows checkpoints F L a b
$ a 1 0
a b
a b
a a
a $
b a 3 2
b a
Tally
Lookup here
succeeds as
usual
Ooops, not a
checkpoint here
But here there’s
one nearby
To resolve a lookup for a character c in
non-checkpoint row, scan along L until we
get to nearest checkpoint. Use tally as the
checkpoint, adjusted for #of cs we saw
along the way
FM-index: fast rank calculations
49
Another example
L a b
... ... ...
b 234 222
b
a
b
b
a
b
a
a 238 226
... ... ...
Tally
What is my rank?
222 + 2 - 1 = 223
What is my rank?
238 - 1 - 1 = 236
Assuming checkpoints are
spaced O(1) distance apart, so
lookups are O(1)
FM-index: a few problems
50
SOLVED! At the expense of adding checkpoints to index -> O(m) integers
● With checkpoints scan takes time O(1)
● With checkpoints we greatly reduce number of integers needed for ranks
○ But it is still O(m) space. There’s a literature to improve this space bound
NOT YET RESOLVED: need a way to find where these occurrences are in T
FM-index & Reads
Compression
A more space-efficient search for data occurrences
& Compression idea for read
> D’Avino Ferdinando
Find Occurrences in T
52
Need a way to find where some occurrences are in T
A Naive Method : STORE SUFFIX ARRAY AND LOOK UP
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
$
a $
a a b a $
a b a $
a b a a b a $
b a $
b a a b a $
6
5
2
3
0
4
1
Offset : 0, 3
str : aba
Find Occurrences in T
53
Another Idea: STORE SOME ENTRIES AND LOOK UP USING FM-INDEX
F L
$ a b a a b a₀
a₀ $ a b a a b₀
a₁ a b a $ a b₁
a₂ b a $ a b a₁
a₃ b a a b a $
b₀ a $ a b a a₂
b₁ a a b a $ a₃
$
a a b a $
a b a a b a $
b a $
6
x
2
x
0
4
x
str : aba
aba INDEX: ?
aaba INDEX: 2
aba INDEX: 3
IMPORTANT: The entries of SA to store are selected in such a way as to have a
constant distant from each other. In our example this distance is 2.
FM Index: Small memory footprint
54
Component of the FM Index
First Column (F) : ~ | ∑ | integers
Last Column (L) : m characters
SA sample: (m ・a) integers, where a is the number of rows kept
Checkpoints: (m ・ | ∑ | ・b) integers, where b is the number of rows checkpointed
FM Index: Small memory footprint
55
DNA alphabet (2 bit per nucleotide), T=human genome, a=1/32 b=1/128
First Column (F) : 4byte ・ 4 = 16 bytes
Last Column (L) : 2bit ・3 billion chars = 750 MB
SA sample: (3 billion chars ・4byte)/32 = ~ 400 MB
Checkpoints: (3 billion chars ・4byte)/128 = ~ 100 MB
Total < 1.5 GB
Compression of BWT Strings
56
Lots of possible compression schemes will benefit from preprocessing with BWT,
because it tends to group runs of same letters together
Ferragina & Manzini scheme
Move to First Transform
57
The main idea is to replace every symbol with its index in the stack
of “recently used symbol”
Long sequences of identical symbol are replaced by as many 0s
Move to First Transform
58
∑ = {A C G T}...GCGACCT...
Δ = {0 1 2 3}
GCGACCT 2 A C G T
GCGACCT 2,2 G A C T
GCGACCT 2,2,1 C G A T
GCGACCT 2,2,1,2 G C A T
GCGACCT 2,2,1,2,2 A G C T
GCGACCT 2,2,1,2,2,0 C A G T
GCGACCT 2,2,1,2,2,0,3 C A G T
FINAL 2,2,1,2,2,0,3 C A G T MTF (GCGACCT)
RLE Operation & Prefix Code
59
RLE Operation trivially consist in replacing long sequences of 0s
with an integer representing the length of the sequences
The output at this point can actually be compressed by prefix-based algorithms,
such as Huffman compression or arithmetic compression
Possible future
developments
60
Possible future developments
61
Try to extends RLE Operation to replace all long sequences of the same char
Try other algorithms for the final phase
THANKS!
Any questions?
You can find us at
✘ f.davino10@studenti.unisa.it
✘ l.sarto1@studenti.unisa.it
✘ s.shevchenko@studenti.unisa.it
62

More Related Content

More from Sergio Shevchenko

μ-Kernel Evolution
μ-Kernel Evolutionμ-Kernel Evolution
μ-Kernel Evolution
Sergio Shevchenko
 
Presentazione CERT-CHECK
Presentazione CERT-CHECKPresentazione CERT-CHECK
Presentazione CERT-CHECK
Sergio Shevchenko
 
Design patterns: Creational patterns
Design patterns: Creational patternsDesign patterns: Creational patterns
Design patterns: Creational patterns
Sergio Shevchenko
 
Bitcoin and blockchain
Bitcoin and blockchainBitcoin and blockchain
Bitcoin and blockchain
Sergio Shevchenko
 
Qt Multiplatform development
Qt Multiplatform developmentQt Multiplatform development
Qt Multiplatform development
Sergio Shevchenko
 
Qt for beginners
Qt for beginnersQt for beginners
Qt for beginners
Sergio Shevchenko
 
Continuous Integration
Continuous IntegrationContinuous Integration
Continuous Integration
Sergio Shevchenko
 
Mobile Factor App
Mobile Factor AppMobile Factor App
Mobile Factor App
Sergio Shevchenko
 

More from Sergio Shevchenko (8)

μ-Kernel Evolution
μ-Kernel Evolutionμ-Kernel Evolution
μ-Kernel Evolution
 
Presentazione CERT-CHECK
Presentazione CERT-CHECKPresentazione CERT-CHECK
Presentazione CERT-CHECK
 
Design patterns: Creational patterns
Design patterns: Creational patternsDesign patterns: Creational patterns
Design patterns: Creational patterns
 
Bitcoin and blockchain
Bitcoin and blockchainBitcoin and blockchain
Bitcoin and blockchain
 
Qt Multiplatform development
Qt Multiplatform developmentQt Multiplatform development
Qt Multiplatform development
 
Qt for beginners
Qt for beginnersQt for beginners
Qt for beginners
 
Continuous Integration
Continuous IntegrationContinuous Integration
Continuous Integration
 
Mobile Factor App
Mobile Factor AppMobile Factor App
Mobile Factor App
 

Recently uploaded

Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 

Recently uploaded (20)

Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 

Burrows-Wheeler transform for terabases

  • 1. Burrows-Wheeler transform for terabasesBased on work of Jouni Sirén Wellcome Trust Sanger Institute
  • 2. HELLO! We are ... 2 D’Avino Ferdinando Sarto Lino Shevchenko Sergiy
  • 3. Contents ➔ DNA, reads and strings ➔ Burrows Wheeler transform ➔ FM-index ➔ FM-index & Reads Compression 3
  • 4. Introduction DNA, reads and strings > D’Avino Ferdinando
  • 6. DNA A T G C Your genome is sort like a book of recipes, with a separate recipe for each type of molecule in your body DNA is the kind of molecule that encode your genome, the sum of all your genetic information, all your genes
  • 7. DNA These letters stand for different kinds of molecules, different bases. A for adenine, C for cytosine, G for guanine, and T for thymine A DNA molecule is shaped like a double helix, this thing that looks like a twisted ladder. And the rungs of this ladder are made up of pairs of bases We can take the DNA molecule and turn it into a sequence of letters, a string
  • 8. DNA
  • 9. DNA
  • 10. DNA Sequencer Machine 10 A DNA sequencer is a scientific instrument used to automate the DNA sequencing process. Given a sample of DNA, a DNA sequencer is used to determine the order of the four bases: G (guanine), C (cytosine), A (adenine) and T (thymine). This is then reported as a text string, called a read
  • 11. DNA Randomly selected snippets out from the middle of the input DNA, many, many, many, many of them Reads Your Genome
  • 17. 3 Gbp 100 Gbp A sequence machine produces a lot of redundant data 17 Of human genome Of reads
  • 18. 1 Byte = 4 Base pairs [A-T] [T-A] [C-G] [G-C] = [01] [10] [00] [11] 1.5GB 46 Chromosomes 23 from each parent 18
  • 19. 1.5 10 GB Whoa! That’s a big number, aren’t you proud? 19 14 All genome sequences in human body x
  • 20. Burrows-Wheeler transform A first approach in space optimization > Shevchenko Sergiy
  • 21. Burrows-Wheeler transform a b a a b a $ Reversible permutation of characters of a string, used originally for compression $ a b a a b a a $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a Sort BWT matrix a b b a $ a a T BWT(T) How is it useful for compression? How is it reversible?
  • 23. Burrows-Wheeler transform 23 BWT bears a resemblance to suffix array BWM(T) SA(T) $ a b a a b a a $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a $ a b a a b a a $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a Sort order is the same whether rows are rotations or suffixes 6 5 2 3 0 4 1
  • 24. Burrows-Wheeler transform 24 In fact, this gives us a new definition/way to construct BWT(T): BWM(T) SA(T) $ a b a a b a a $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a $ a $ a a b a $ a b a $ a b a a b a $ b a $ b a a b a $ “BWT = characters just to the left of the suffixes in the suffix array” 6 5 2 3 0 4 1
  • 25. Burrows-Wheeler transform 25 a b a a b a $ How to reverse the BWT? $ a b a a b a a $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a Sort BWT matrix a b b a $ a a T BWT(T) ? BWM has a key property called LF mapping
  • 26. Burrows-Wheeler transform: T-ranking 26 aO bO a1 a2 b1 a3 $ Give each character in T a rank , equal to # times the character occurred previously in T. Call this the T-ranking. Ranks aren’t explicitly stored; they are just for illustration Now, let's rewrite BWM including ranks...
  • 27. Burrows-Wheeler transform 27 BWM with ranking $ a0 b0 a1 a2 b1 a3 a3 $ a0 b0 a1 a2 b1 a1 a2 b1 a3 $ a0 b0 a2 b1 a3 $ a0 b0 a1 a0 b0 a1 a2 b1 a3 $ b1 a3 $ a0 b0 a1 a2 b0 a1 a2 b1 a3 $ a0
  • 28. Burrows-Wheeler transform 28 BWM with ranking $ a0 b0 a1 a2 b1 a3 a3 $ a0 b0 a1 a2 b1 a1 a2 b1 a3 $ a0 b0 a2 b1 a3 $ a0 b0 a1 a0 b0 a1 a2 b1 a3 $ b1 a3 $ a0 b0 a1 a2 b0 a1 a2 b1 a3 $ a0 F L Look at the first and the last columns, called F and L
  • 29. Burrows-Wheeler transform 29 BWM with ranking $ a0 b0 a1 a2 b1 a3 a3 $ a0 b0 a1 a2 b1 a1 a2 b1 a3 $ a0 b0 a2 b1 a3 $ a0 b0 a1 a0 b0 a1 a2 b1 a3 $ b1 a3 $ a0 b0 a1 a2 b0 a1 a2 b1 a3 $ a0 F L Look at the first and the last columns, called F and L And look at just as as occurs in the same order in F and in L
  • 30. Burrows-Wheeler transform 30 BWM with ranking $ a0 b0 a1 a2 b1 a3 a3 $ a0 b0 a1 a2 b1 a1 a2 b1 a3 $ a0 b0 a2 b1 a3 $ a0 b0 a1 a0 b0 a1 a2 b1 a3 $ b1 a3 $ a0 b0 a1 a2 b0 a1 a2 b1 a3 $ a0 F L Same with bs
  • 31. Burrows-Wheeler transform 31 BWM with ranking: $ a0 b0 a1 a2 b1 a3 a3 $ a0 b0 a1 a2 b1 a1 a2 b1 a3 $ a0 b0 a2 b1 a3 $ a0 b0 a1 a0 b0 a1 a2 b1 a3 $ b1 a3 $ a0 b0 a1 a2 b0 a1 a2 b1 a3 $ a0 F L LF mapping: The ith occurrence of the character c in L and the ith occurrence of c in F correspond to the same occurence in T (i.e. have same rank) However we rank occurrences of c, ranks appears in the same order in F and L
  • 32. Burrows-Wheeler transform 32 BWM with T-ranking: $ a0 b0 a1 a2 b1 a3 a0 $ a0 b0 a1 a2 b1 a1 a2 b1 a3 $ a0 b0 a2 b1 a3 $ a0 b0 a1 a0 b0 a1 a2 b1 a3 $ b1 a3 $ a0 b0 a1 a2 b0 a1 a2 b1 a3 $ a0 F L
  • 33. Burrows-Wheeler transform 33 BWM with B-ranking: $ a3 b1 a1 a2 b0 a0 a0 $ a3 b1 a1 a2 b0 a1 a2 b0 a3 $ a3 b1 a2 b0 a0 $ a3 b1 a1 a3 b1 a1 a2 b0 a0 $ b0 a0 $ a3 b1 a1 a2 b1 a1 a2 b0 a0 $ a3 F L Ascending orderF now has a very simple structure: a $, a block of as with ascending ranks, as block of bs with ascending ranks
  • 34. Burrows-Wheeler transform 34 a0 b0 b1 a1 $ a2 a3 F L $ a0 a1 a2 a3 b0 b1 Which BWM row begins with b1? Skip row starting with $(1 row) Skip rows starting with a(4 rows) Skip row starting with b0 (1 row) Answer: row 6row 6
  • 35. Burrows-Wheeler transform 35 Say T has 300 As, 400 Cs, 250 Gs and 700 Ts and $ < A < C < G < T Which BWM row (0-based) begins with G100? (Ranks are B-ranks) ✘ Skip row starting with $ ✘ Skip 300 A ✘ Skip 400 C ✘ Skip first 100 rows starting with G (100 rows) Answer: 1 + 300 + 400 + 100 = row 801
  • 36. Burrows-Wheeler transform 36 a0 b0 b1 a1 $ a2 a3 F L $ a0 a1 a2 a3 b0 b1 Reverse BWT(T) starting at right-hand-side of T and moving left Start in first row. F must have $. L contains character just prior to $: a0 a0 : LF mapping says this is the same occurence of a in as first a in F. Jump to row beginning with a0. L contains character just prior to a0: b0 Repeat for b0, get a2 Repeat for a2, get a1 Repeat for a1, get b1 Repeat for b1, get a3 Repeat for a3, get $, done Reverse of characters we visited = a3 b1 a1 a2 b0 a0 $ = T
  • 37. Burrows-Wheeler transform 37 1. We’ve seen how BWT is useful for compression: a. Sorts characters by right-context, making a more compressible string 2. And how it’s reversible: a. Repeated applications of LF Mapping, recreating T from right to left How is it used to index?
  • 38. FM-index A more space-efficient alternative to suffix array > Sarto Lino
  • 39. FM-index 39 FM-index: an index combining the BWT with a few small auxiliary data structures Core of index consists of F and L in BWM: ➔ F can be represented very simply (1 integer per alphabet character) ➔ L is compressible ➔ Potentially very space economical! $ a b a a b a a $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a F L Not stored in index
  • 40. FM-index: querying 40 Though BWM is related to suffix array, we can’t query it the same way $ a $ a a b a $ a b a $ a b a a b a $ b a $ b a a b a $ 6 5 2 3 0 4 1 $ a b a a b a a $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a $ a b a a b a a $ a b a a b a a b a $ a b a b a $ a b a a b a a b a $ b a $ a b a a b a a b a $ a We don’t have these columns. Binary search is not possible
  • 41. FM-index: querying 41 Look for range of rows of BWM(T) with P as prefix Do this for the P’s shortest suffix, then extend to successively longer suffix until range becomes empty or we have exhausted P F L P = aba Easy to find all rows beginning with a, thanks to F’s simple structure $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ P = aba
  • 42. FM-index: querying 42 We have row beginning with a, now we seek rows beginning with ba F L P = aba $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ Look at those rows in L. b₀, b₁ are bs appearing just to left. F L P = aba $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ Use LF Mapping. Let new range delimit those bs.
  • 43. FM-index: querying 43 We have row beginning with ba, now we seek rows beginning with aba F L P = aba $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ F L P = aba $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ a₂ a₃ occur just to left. Use LF Mapping
  • 44. FM-index: querying 44 Now we have the same range [3, 5[ We would have got from querying suffix array P = aba F L $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃Where are these? [3, 5[ $ a $ a a b a $ a b a $ a b a a b a $ b a $ b a a b a $ 6 5 2 3 0 4 1 Unlike the suffix array, we do not immediately know where the matches are in T... [3, 5[ Note: when P does not occur in T, it fails to find next character in L
  • 45. FM-index: querying If we scan characters in the last column, that can be very slow, O(m) F L $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ Scan, looking for bs P = abaP = aba 45
  • 46. FM-index: issues 46 1. Scanning for preceding characters is slow 2. Storing ranks takes too much space 3. Need way to find where matches occur in T
  • 47. FM-index: fast rank calculations 47 F L $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ Is there an O(1) way to determine which bs precede the as in our range? Idea: pre-calculate number of as, bs in L up to every row F L a b $ a 1 0 a b 1 1 a b 1 2 a a 2 2 a $ 2 2 b a 3 2 b a 4 2 Tally O(1) time, but requires m x |Σ| integers
  • 48. FM-index: fast rank calculations 48 Another idea: pre-calculate # as, bs up to some rows, e.g. every 5th row. Call pre-calculated rows checkpoints F L a b $ a 1 0 a b a b a a a $ b a 3 2 b a Tally Lookup here succeeds as usual Ooops, not a checkpoint here But here there’s one nearby To resolve a lookup for a character c in non-checkpoint row, scan along L until we get to nearest checkpoint. Use tally as the checkpoint, adjusted for #of cs we saw along the way
  • 49. FM-index: fast rank calculations 49 Another example L a b ... ... ... b 234 222 b a b b a b a a 238 226 ... ... ... Tally What is my rank? 222 + 2 - 1 = 223 What is my rank? 238 - 1 - 1 = 236 Assuming checkpoints are spaced O(1) distance apart, so lookups are O(1)
  • 50. FM-index: a few problems 50 SOLVED! At the expense of adding checkpoints to index -> O(m) integers ● With checkpoints scan takes time O(1) ● With checkpoints we greatly reduce number of integers needed for ranks ○ But it is still O(m) space. There’s a literature to improve this space bound NOT YET RESOLVED: need a way to find where these occurrences are in T
  • 51. FM-index & Reads Compression A more space-efficient search for data occurrences & Compression idea for read > D’Avino Ferdinando
  • 52. Find Occurrences in T 52 Need a way to find where some occurrences are in T A Naive Method : STORE SUFFIX ARRAY AND LOOK UP F L $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ $ a $ a a b a $ a b a $ a b a a b a $ b a $ b a a b a $ 6 5 2 3 0 4 1 Offset : 0, 3 str : aba
  • 53. Find Occurrences in T 53 Another Idea: STORE SOME ENTRIES AND LOOK UP USING FM-INDEX F L $ a b a a b a₀ a₀ $ a b a a b₀ a₁ a b a $ a b₁ a₂ b a $ a b a₁ a₃ b a a b a $ b₀ a $ a b a a₂ b₁ a a b a $ a₃ $ a a b a $ a b a a b a $ b a $ 6 x 2 x 0 4 x str : aba aba INDEX: ? aaba INDEX: 2 aba INDEX: 3 IMPORTANT: The entries of SA to store are selected in such a way as to have a constant distant from each other. In our example this distance is 2.
  • 54. FM Index: Small memory footprint 54 Component of the FM Index First Column (F) : ~ | ∑ | integers Last Column (L) : m characters SA sample: (m ・a) integers, where a is the number of rows kept Checkpoints: (m ・ | ∑ | ・b) integers, where b is the number of rows checkpointed
  • 55. FM Index: Small memory footprint 55 DNA alphabet (2 bit per nucleotide), T=human genome, a=1/32 b=1/128 First Column (F) : 4byte ・ 4 = 16 bytes Last Column (L) : 2bit ・3 billion chars = 750 MB SA sample: (3 billion chars ・4byte)/32 = ~ 400 MB Checkpoints: (3 billion chars ・4byte)/128 = ~ 100 MB Total < 1.5 GB
  • 56. Compression of BWT Strings 56 Lots of possible compression schemes will benefit from preprocessing with BWT, because it tends to group runs of same letters together Ferragina & Manzini scheme
  • 57. Move to First Transform 57 The main idea is to replace every symbol with its index in the stack of “recently used symbol” Long sequences of identical symbol are replaced by as many 0s
  • 58. Move to First Transform 58 ∑ = {A C G T}...GCGACCT... Δ = {0 1 2 3} GCGACCT 2 A C G T GCGACCT 2,2 G A C T GCGACCT 2,2,1 C G A T GCGACCT 2,2,1,2 G C A T GCGACCT 2,2,1,2,2 A G C T GCGACCT 2,2,1,2,2,0 C A G T GCGACCT 2,2,1,2,2,0,3 C A G T FINAL 2,2,1,2,2,0,3 C A G T MTF (GCGACCT)
  • 59. RLE Operation & Prefix Code 59 RLE Operation trivially consist in replacing long sequences of 0s with an integer representing the length of the sequences The output at this point can actually be compressed by prefix-based algorithms, such as Huffman compression or arithmetic compression
  • 61. Possible future developments 61 Try to extends RLE Operation to replace all long sequences of the same char Try other algorithms for the final phase
  • 62. THANKS! Any questions? You can find us at ✘ f.davino10@studenti.unisa.it ✘ l.sarto1@studenti.unisa.it ✘ s.shevchenko@studenti.unisa.it 62