Comparative Genomics and de Bruijn graphs

Comparative Genomics and the
de Bruijn graphs
Ilia Minkin
Pennsylvania State University
16th September 2016
1 / 43

What is comparative genomics?
The collection of all research activities that derive
biological insights by comparing genomic features.
1
1Comparative Genomics, Xuhua Xia
2 / 43

What is comparative genomics?
The collection of all research activities that derive
biological insights by comparing genomic features.
1
Why do it?
Learn evolution
Learn function
1Comparative Genomics, Xuhua Xia
2 / 43

Learn Function
A genomic sequence itself does not show its functions
How to nd function?
Compare with sequences of know function
Conserved sequences are likely to be important
How to compare genomes?
4 / 43

What is an Alignment?
Organisms inherit genomes but with errors:
The Ancestor
Genome A Genome B
Which characters A and B got from its ancestor?
5 / 43

What is an Alignment?
Alignments are written down as a table:
ACTG-TGA
ACTACTGA
Blue letters are matches; yellow are mismatches;
dashes are indels.
This is a global alignment.
6 / 43

The Global Alignment
ACTG-TGA
ACTACTGA
For two strings A and B :
Place them under each other
Insert into A and B dashes so that |A| = |B|
Penalize for dashes and mismatches
Which alignment gives the least penalty?
Complexity: O(|A||B |)
7 / 43

The Local Alignment
For large sequences the global alignment does not
work:
GAACTGTGATTAGGACGT
ATTTGGGACTACTGAGTA
8 / 43

The Local Alignment
For large sequences the global alignment does not
work:
GAACTGTGATTAGGACGT
ATTTGGGACTACTGAGTA
Apart from indels and mismatches there could
be rearrangements
Rearrangements change orders of the whole
blocks
Similar subsequences can be interleaved with
something else
8 / 43

The Local Alignment
GAACTGTGATTAGGACGT
ATTTGGGACTACTGAGTA
A problem: for A and B nd their most similar
subsequences and their alignment:
ACTG-TGA
ACTACTGA
Complexity: O(|A||B |)
9 / 43

An Example
We can generalize to many genomes:
GAACTGTGATTATGCTCA
ATTTGGGACTACTGAGTA
ATCTTGAGATAGCTGAAA
10 / 43

An Example
We can generalize to many genomes:
GAACTGTGATTATGCTCA
ATTTGGGACTACTGAGTA
ATCTTGAGATAGCTGAAA
Alignments:
ACTG-TGA
ACTACTGA
A-TGCTCA
10 / 43

Multiple Local Alignment
Issues:
Some subsequences can be present in some
genomes and absent in others
Genomes can have duplications
Multiple sequence alignment is NP-hard
11 / 43

Multiple Local Alignment
Issues:
Some subsequences can be present in some
genomes and absent in others
Genomes can have duplications
Multiple sequence alignment is NP-hard
→ We need some heuristics
11 / 43

Another Approach
Another way to nd common subsequences is to
build a graph from genomes
In such a graph homologous subsequences will
collapse into non-branching paths while unique ones
will form disjoint paths
12 / 43

The Linear Representation
Two genomes:
13 / 43

Solution: a Graph Representation
What we want to see:
14 / 43

Why de Bruijn graph?
A simple object.
Demonstrated utility in:
Assembly
Read mapping
Synteny identication
16 / 43

The de Bruijn Graph
k = 2
TGACGTC TGACTTC
17 / 43

The de Bruijn Graph
k = 2
TGACGTC TGACTTC
AC GTCGGATG TC
17 / 43

The de Bruijn Graph
k = 2
TGACGTC TGACTTC
AC GTCGGATG TC
AC TTGATG TCCT
17 / 43

The de Bruijn Graph
AC GTCGGATG TC
AC TTGATG TCCT
18 / 43

The de Bruijn Graph
AC GTCGGATG TC
AC TTGATG TCCT
AC
GT
TT
CG
GATG TC
CT
18 / 43

The de Bruijn graph
In the de Bruijn graph identical substrings of length
at least k + 1 are collapsed into non-branching paths
We can use this to nd homologous blocks.
We developed a tool Sibelia that nds such blocks
in many bacterial genomes and handles repeats.
But we can do more.
19 / 43

Alignment to a Graph
It is common to have an unassembled genome
Reads are then aligned to a very similar reference
genome:
20 / 43

Issues:
More than one reference?
Repeats within genomes?
21 / 43

Issues:
More than one reference?
Repeats within genomes?
Solution: align reads to a graph!
21 / 43

In the future genome graphs will encode information
about a population
22 / 43

about a population
Aligning reads to a graph has many advantages:
Ecient alignment to many genomes
Reusing information about variants
Handling of repeats
22 / 43

about a population
Handling of repeats
The de Bruijn graph is a feasible model for a graph
reference.
22 / 43

about a population
Handling of repeats
The de Bruijn graph is a feasible model for a graph
reference.
Issue the graph can be too large.
22 / 43

Compaction
After compaction:
TGAC
ACGTC
ACTTC
TG AC TC
23 / 43

The Challenge
Construct the compacted graph from many large
genomes bypassing the ordinary graph traverse.
24 / 43

The Challenge
Earlier work: based on sux arrays/trees Sibelia
SplitMEM handled 60 E.Coli genomes.
24 / 43

The Challenge
Earlier work: based on sux arrays/trees Sibelia
SplitMEM handled 60 E.Coli genomes.
A recent advance: 7 Humans in 15 hours using 100
GB of RAM using a BWT-based algorithm by Baier
et al., 2015, Beller et al., 2014.
24 / 43

Junctions
A vertex v is a junction if:
v has ≥ 2 distinct outgoing or incoming edges:
25 / 43

Junctions
v is the rst or the last k -mer of an input string
25 / 43

Junctions
v is the rst or the last k -mer of an input string
Facts:
Junctions = vertices of the compacted graph
Compaction = nding positions of junctions
25 / 43

Observations
TGAC
ACGTC
ACTTC
TG AC TC
26 / 43

Observations
TGAC
ACGTC
ACTTC
TG AC TC
TG GA AC CG GT TC
26 / 43

Observations
TGAC
ACGTC
ACTTC
TG AC TC
TG GA AC CG GT TC
TG → AC → TC
26 / 43

The Observation
The observation only works when we have complete
genomes.
Once we know junctions, construction of the edges is
simple.
We can simply traverse input strings and record
junctions in the order they appear.
How to identify junctions?
27 / 43

The Naive Algorithm
A naive way:
Store all (k + 1)-mers (edges) in a hash table
Consider each vertex one by one
Query all possible edges from the table
If found 1 edge, mark vertex as a junction
28 / 43

The Naive Algorithm
A naive way:
Store all (k + 1)-mers (edges) in a hash table
Consider each vertex one by one
Query all possible edges from the table
If found 1 edge, mark vertex as a junction
Problem: the hash table can be too large.
28 / 43

An Example
Hash table = { GA → AC }
AA
AG
AC
AT
GA
29 / 43

What is the Bloom lter
A probabilistic data structure representing a set
Properties:
Occupies xed space
May generate false positives on queries
False positive rate is low
30 / 43

Properties:
Occupies xed space
Example: Bloom Filter = { GA → AC }
Is GA → AC in the set? Yes.
30 / 43

Properties:
Occupies xed space
Example: Bloom Filter = { GA → AC }
Is GA → AC in the set? Yes.
Is GA → AT in the set? Maybe no.
30 / 43

An Example
Bloom Filter = { GA → AC, GA → AT }
AA
AG
AC
AT
GA
The purple edge is a false positive.
31 / 43

The Two Pass Algorithm
How to eliminate false positives?
32 / 43

The Two Pass Algorithm
How to eliminate false positives?
Two-pass algorithm:
1. Use the Bloom lter to identify junction
candidates
2. Use the hash table, but store only edges that
touch candidates
32 / 43

An Example: the First Step
Here edges stored in the Bloom lter, purple ones are
false positives:
AC GT
CC
TT
CG
AT
GATG
TC
CT
Junction candidates: GA AC
33 / 43

An Example: the Second Step
Edges stored in the hash table. We kept only edges
touching junction candidates:
Junction: AC
34 / 43

Results
Datasets:
7 humans: 5 versions of the reference +
2 haplotypes of NA12878 from 1000 Genomes
93 simulated humans (FIGG)
8 primates available in UCSC genome browser
35 / 43

Results
Running time (minutes) memory usage (GBs).
# genomes BWT-based TwoPaCo
1 thread 1 thread 15 threads
Humans
7, k = 25 867 (100.30) 436 (4.40) 63 (4.84)
7, k = 100 807 (46.02) 317 (8.42) 57 (8.75)
43+7, k = 25 - - 705 (69.77)
43+7, k = 100 - - 927 (70.21)
93+7, k = 25 - - 1383 (77.42)
Primates
8, k = 25 - 914 (34.36) 111 (34.36)
8,k = 100 - 756 (56.06) 101 (61.68)
36 / 43

Conclusion Future Work
Advantages of the algorithm:
Fast
Small memory footprint
Can handle large inputs
Drawbacks:
Less applicable for large k
37 / 43

Advantages of the algorithm:
Fast
Small memory footprint
Can handle large inputs
Drawbacks:
Less applicable for large k
Take home message: it is easy to construct the
compacted de Bruijn graph for complete genomes.
37 / 43

Can potentially facilitate:
Visualization
Synteny mining (Sibelia)
Structural variations analysis
...
38 / 43

Acknowledgments
Personal:
Daniel Lemire
GFA format working group
Funding, NSF awards:
DBI-1356529
CCF-1439057
IIS-1453527
IIS-1421908
39 / 43

Thank you for your attention!
Twitter: @IliaMinkin
40 / 43

Input Size vs. Performance
41 / 43

Splitting
Table 1: The minimal number of rounds it takes to compress
the graph without exceeding a given memory threshold.
Memory threshold Used memory Bloom lter size Running time Rounds
10 8.62 8.59 259 1
8 6.73 4.29 434 3
6 5.98 4.29 539 4
4 3.51 2.14 665 6
43 / 43

Comparative Genomics and de Bruijn graphs

Recommended

Recommended

More Related Content

Similar to Comparative Genomics and de Bruijn graphs

Similar to Comparative Genomics and de Bruijn graphs (20)

More from BioinformaticsInstitute

More from BioinformaticsInstitute (20)

Recently uploaded

Recently uploaded (20)

Comparative Genomics and de Bruijn graphs