SlideShare a Scribd company logo
Comparative Genomics and the
de Bruijn graphs
Ilia Minkin
Pennsylvania State University
16th September 2016
1 / 43
What is comparative genomics?
The collection of all research activities that derive
biological insights by comparing genomic features.
1
1Comparative Genomics, Xuhua Xia
2 / 43
What is comparative genomics?
The collection of all research activities that derive
biological insights by comparing genomic features.
1
Why do it?
Learn evolution
Learn function
1Comparative Genomics, Xuhua Xia
2 / 43
Learn Evolution
3 / 43
Learn Function
A genomic sequence itself does not show its functions
How to nd function?
Compare with sequences of know function
Conserved sequences are likely to be important
How to compare genomes?
4 / 43
What is an Alignment?
Organisms inherit genomes but with errors:
The Ancestor
Genome A Genome B
Which characters A and B got from its ancestor?
5 / 43
What is an Alignment?
Alignments are written down as a table:
ACTG-TGA
ACTACTGA
Blue letters are matches; yellow are mismatches;
dashes are indels.
This is a global alignment.
6 / 43
The Global Alignment
ACTG-TGA
ACTACTGA
For two strings A and B :
Place them under each other
Insert into A and B dashes so that |A| = |B|
Penalize for dashes and mismatches
Which alignment gives the least penalty?
Complexity: O(|A||B |)
7 / 43
The Local Alignment
For large sequences the global alignment does not
work:
GAACTGTGATTAGGACGT
ATTTGGGACTACTGAGTA
8 / 43
The Local Alignment
For large sequences the global alignment does not
work:
GAACTGTGATTAGGACGT
ATTTGGGACTACTGAGTA
Apart from indels and mismatches there could
be rearrangements
Rearrangements change orders of the whole
blocks
Similar subsequences can be interleaved with
something else
8 / 43
The Local Alignment
GAACTGTGATTAGGACGT
ATTTGGGACTACTGAGTA
A problem: for A and B nd their most similar
subsequences and their alignment:
ACTG-TGA
ACTACTGA
Complexity: O(|A||B |)
9 / 43
An Example
We can generalize to many genomes:
GAACTGTGATTATGCTCA
ATTTGGGACTACTGAGTA
ATCTTGAGATAGCTGAAA
10 / 43
An Example
We can generalize to many genomes:
GAACTGTGATTATGCTCA
ATTTGGGACTACTGAGTA
ATCTTGAGATAGCTGAAA
Alignments:
ACTG-TGA
ACTACTGA
A-TGCTCA
10 / 43
Multiple Local Alignment
Issues:
Some subsequences can be present in some
genomes and absent in others
Genomes can have duplications
Multiple sequence alignment is NP-hard
11 / 43
Multiple Local Alignment
Issues:
Some subsequences can be present in some
genomes and absent in others
Genomes can have duplications
Multiple sequence alignment is NP-hard
→ We need some heuristics
11 / 43
Another Approach
Another way to nd common subsequences is to
build a graph from genomes
In such a graph homologous subsequences will
collapse into non-branching paths while unique ones
will form disjoint paths
12 / 43
The Linear Representation
Two genomes:
13 / 43
Solution: a Graph Representation
What we want to see:
14 / 43
Genomes as a Railroad
15 / 43
Why de Bruijn graph?
A simple object.
Demonstrated utility in:
Assembly
Read mapping
Synteny identication
16 / 43
The de Bruijn Graph
k = 2
TGACGTC TGACTTC
17 / 43
The de Bruijn Graph
k = 2
TGACGTC TGACTTC
AC GTCGGATG TC
17 / 43
The de Bruijn Graph
k = 2
TGACGTC TGACTTC
AC GTCGGATG TC
AC TTGATG TCCT
17 / 43
The de Bruijn Graph
AC GTCGGATG TC
AC TTGATG TCCT
18 / 43
The de Bruijn Graph
AC GTCGGATG TC
AC TTGATG TCCT
AC
GT
TT
CG
GATG TC
CT
18 / 43
The de Bruijn graph
In the de Bruijn graph identical substrings of length
at least k + 1 are collapsed into non-branching paths
We can use this to nd homologous blocks.
We developed a tool Sibelia that nds such blocks
in many bacterial genomes and handles repeats.
But we can do more.
19 / 43
Alignment to a Graph
It is common to have an unassembled genome
Reads are then aligned to a very similar reference
genome:
20 / 43
Alignment to a Graph
Issues:
More than one reference?
Repeats within genomes?
21 / 43
Alignment to a Graph
Issues:
More than one reference?
Repeats within genomes?
Solution: align reads to a graph!
21 / 43
Alignment to a Graph
In the future genome graphs will encode information
about a population
22 / 43
Alignment to a Graph
In the future genome graphs will encode information
about a population
Aligning reads to a graph has many advantages:
Ecient alignment to many genomes
Reusing information about variants
Handling of repeats
22 / 43
Alignment to a Graph
In the future genome graphs will encode information
about a population
Aligning reads to a graph has many advantages:
Ecient alignment to many genomes
Reusing information about variants
Handling of repeats
The de Bruijn graph is a feasible model for a graph
reference.
22 / 43
Alignment to a Graph
In the future genome graphs will encode information
about a population
Aligning reads to a graph has many advantages:
Ecient alignment to many genomes
Reusing information about variants
Handling of repeats
The de Bruijn graph is a feasible model for a graph
reference.
Issue the graph can be too large.
22 / 43
Compaction
23 / 43
Compaction
After compaction:
TGAC
ACGTC
ACTTC
TG AC TC
23 / 43
The Challenge
Construct the compacted graph from many large
genomes bypassing the ordinary graph traverse.
24 / 43
The Challenge
Construct the compacted graph from many large
genomes bypassing the ordinary graph traverse.
Earlier work: based on sux arrays/trees Sibelia 
SplitMEM handled  60 E.Coli genomes.
24 / 43
The Challenge
Construct the compacted graph from many large
genomes bypassing the ordinary graph traverse.
Earlier work: based on sux arrays/trees Sibelia 
SplitMEM handled  60 E.Coli genomes.
A recent advance: 7 Humans in 15 hours using 100
GB of RAM using a BWT-based algorithm by Baier
et al., 2015, Beller et al., 2014.
24 / 43
Junctions
A vertex v is a junction if:
v has ≥ 2 distinct outgoing or incoming edges:
25 / 43
Junctions
A vertex v is a junction if:
v has ≥ 2 distinct outgoing or incoming edges:
v is the rst or the last k -mer of an input string
25 / 43
Junctions
A vertex v is a junction if:
v has ≥ 2 distinct outgoing or incoming edges:
v is the rst or the last k -mer of an input string
Facts:
Junctions = vertices of the compacted graph
Compaction = nding positions of junctions
25 / 43
Observations
TGAC
ACGTC
ACTTC
TG AC TC
26 / 43
Observations
TGAC
ACGTC
ACTTC
TG AC TC
TG GA AC CG GT TC
26 / 43
Observations
TGAC
ACGTC
ACTTC
TG AC TC
TG GA AC CG GT TC
TG → AC → TC
26 / 43
The Observation
The observation only works when we have complete
genomes.
Once we know junctions, construction of the edges is
simple.
We can simply traverse input strings and record
junctions in the order they appear.
How to identify junctions?
27 / 43
The Naive Algorithm
A naive way:
Store all (k + 1)-mers (edges) in a hash table
Consider each vertex one by one
Query all possible edges from the table
If found  1 edge, mark vertex as a junction
28 / 43
The Naive Algorithm
A naive way:
Store all (k + 1)-mers (edges) in a hash table
Consider each vertex one by one
Query all possible edges from the table
If found  1 edge, mark vertex as a junction
Problem: the hash table can be too large.
28 / 43
An Example
Hash table = { GA → AC }
AA
AG
AC
AT
GA
29 / 43
What is the Bloom lter
A probabilistic data structure representing a set
Properties:
Occupies xed space
May generate false positives on queries
False positive rate is low
30 / 43
What is the Bloom lter
A probabilistic data structure representing a set
Properties:
Occupies xed space
May generate false positives on queries
False positive rate is low
Example: Bloom Filter = { GA → AC }
Is GA → AC in the set? Yes.
30 / 43
What is the Bloom lter
A probabilistic data structure representing a set
Properties:
Occupies xed space
May generate false positives on queries
False positive rate is low
Example: Bloom Filter = { GA → AC }
Is GA → AC in the set? Yes.
Is GA → AT in the set? Maybe no.
30 / 43
An Example
Bloom Filter = { GA → AC, GA → AT }
AA
AG
AC
AT
GA
The purple edge is a false positive.
31 / 43
The Two Pass Algorithm
How to eliminate false positives?
32 / 43
The Two Pass Algorithm
How to eliminate false positives?
Two-pass algorithm:
1. Use the Bloom lter to identify junction
candidates
2. Use the hash table, but store only edges that
touch candidates
32 / 43
An Example: the First Step
Here edges stored in the Bloom lter, purple ones are
false positives:
AC GT
CC
TT
CG
AT
GATG
TC
CT
Junction candidates: GA  AC
33 / 43
An Example: the Second Step
Edges stored in the hash table. We kept only edges
touching junction candidates:
Junction: AC
34 / 43
Results
Datasets:
7 humans: 5 versions of the reference +
2 haplotypes of NA12878 from 1000 Genomes
93 simulated humans (FIGG)
8 primates available in UCSC genome browser
35 / 43
Results
Running time (minutes)  memory usage (GBs).
# genomes BWT-based TwoPaCo
1 thread 1 thread 15 threads
Humans
7, k = 25 867 (100.30) 436 (4.40) 63 (4.84)
7, k = 100 807 (46.02) 317 (8.42) 57 (8.75)
43+7, k = 25 - - 705 (69.77)
43+7, k = 100 - - 927 (70.21)
93+7, k = 25 - - 1383 (77.42)
Primates
8, k = 25 - 914 (34.36) 111 (34.36)
8,k = 100 - 756 (56.06) 101 (61.68)
36 / 43
Conclusion  Future Work
Advantages of the algorithm:
Fast
Small memory footprint
Can handle large inputs
Drawbacks:
Less applicable for large k
37 / 43
Conclusion  Future Work
Advantages of the algorithm:
Fast
Small memory footprint
Can handle large inputs
Drawbacks:
Less applicable for large k
Take home message: it is easy to construct the
compacted de Bruijn graph for complete genomes.
37 / 43
Conclusion  Future Work
Can potentially facilitate:
Visualization
Synteny mining (Sibelia)
Structural variations analysis
...
38 / 43
Acknowledgments
Personal:
Daniel Lemire
GFA format working group
Funding, NSF awards:
DBI-1356529
CCF-1439057
IIS-1453527
IIS-1421908
39 / 43
Thank you for your attention!
Twitter: @IliaMinkin
40 / 43
Input Size vs. Performance
41 / 43
Parallel Scalability
42 / 43
Splitting
Table 1: The minimal number of rounds it takes to compress
the graph without exceeding a given memory threshold.
Memory threshold Used memory Bloom lter size Running time Rounds
10 8.62 8.59 259 1
8 6.73 4.29 434 3
6 5.98 4.29 539 4
4 3.51 2.14 665 6
43 / 43

More Related Content

Similar to Comparative Genomics and de Bruijn graphs

de Bruijn Graph Construction from Combination of Short and Long Reads
de Bruijn Graph Construction from Combination of Short and Long Readsde Bruijn Graph Construction from Combination of Short and Long Reads
de Bruijn Graph Construction from Combination of Short and Long Reads
Sikder Tahsin Al-Amin
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
Computer Science Club
 
Second Order Heuristics in ACGP
Second Order Heuristics in ACGPSecond Order Heuristics in ACGP
Second Order Heuristics in ACGP
hauschildm
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
c.titus.brown
 
Ashg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneider
Genome Reference Consortium
 
LPEI_ZCNI_Poster
LPEI_ZCNI_PosterLPEI_ZCNI_Poster
LPEI_ZCNI_Poster
Long Pei
 
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
Jonathan Blakes
 
Paired-end alignments in sequence graphs
Paired-end alignments in sequence graphsPaired-end alignments in sequence graphs
Paired-end alignments in sequence graphs
Chirag Jain
 
Presentation 2009 Journal Club Azhar Ali Shah
Presentation 2009 Journal Club Azhar Ali ShahPresentation 2009 Journal Club Azhar Ali Shah
Presentation 2009 Journal Club Azhar Ali Shah
guest5de83e
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
Abhishek Vatsa
 
04 1 evolution
04 1 evolution04 1 evolution
04 1 evolution
Tianlu Wang
 
Ivd soda-2019
Ivd soda-2019Ivd soda-2019
Ivd soda-2019
AkankshaAgrawal55
 
Review And Evaluations Of Shortest Path Algorithms
Review And Evaluations Of Shortest Path AlgorithmsReview And Evaluations Of Shortest Path Algorithms
Review And Evaluations Of Shortest Path Algorithms
Pawan Kumar Tiwari
 
Review and evaluations of shortest path algorithms
Review and evaluations of shortest path algorithmsReview and evaluations of shortest path algorithms
Review and evaluations of shortest path algorithms
Pawan Kumar Tiwari
 
Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013
Deanna Church
 
P0126557 slides
P0126557 slidesP0126557 slides
P0126557 slides
Nguyen Chien
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
Computer Science Club
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
Nikolay Vyahhi
 
Inria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCInria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCC
Stéphanie Roger
 
Integration of single molecule, genome mapping data in a web-based genome bro...
Integration of single molecule, genome mapping data in a web-based genome bro...Integration of single molecule, genome mapping data in a web-based genome bro...
Integration of single molecule, genome mapping data in a web-based genome bro...
William Chow
 

Similar to Comparative Genomics and de Bruijn graphs (20)

de Bruijn Graph Construction from Combination of Short and Long Reads
de Bruijn Graph Construction from Combination of Short and Long Readsde Bruijn Graph Construction from Combination of Short and Long Reads
de Bruijn Graph Construction from Combination of Short and Long Reads
 
20100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture0720100515 bioinformatics kapushesky_lecture07
20100515 bioinformatics kapushesky_lecture07
 
Second Order Heuristics in ACGP
Second Order Heuristics in ACGPSecond Order Heuristics in ACGP
Second Order Heuristics in ACGP
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
Ashg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneider
 
LPEI_ZCNI_Poster
LPEI_ZCNI_PosterLPEI_ZCNI_Poster
LPEI_ZCNI_Poster
 
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
20080110 Genome exploration in A-T G-C space: an introduction to DNA walking
 
Paired-end alignments in sequence graphs
Paired-end alignments in sequence graphsPaired-end alignments in sequence graphs
Paired-end alignments in sequence graphs
 
Presentation 2009 Journal Club Azhar Ali Shah
Presentation 2009 Journal Club Azhar Ali ShahPresentation 2009 Journal Club Azhar Ali Shah
Presentation 2009 Journal Club Azhar Ali Shah
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 
04 1 evolution
04 1 evolution04 1 evolution
04 1 evolution
 
Ivd soda-2019
Ivd soda-2019Ivd soda-2019
Ivd soda-2019
 
Review And Evaluations Of Shortest Path Algorithms
Review And Evaluations Of Shortest Path AlgorithmsReview And Evaluations Of Shortest Path Algorithms
Review And Evaluations Of Shortest Path Algorithms
 
Review and evaluations of shortest path algorithms
Review and evaluations of shortest path algorithmsReview and evaluations of shortest path algorithms
Review and evaluations of shortest path algorithms
 
Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013
 
P0126557 slides
P0126557 slidesP0126557 slides
P0126557 slides
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
Inria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCInria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCC
 
Integration of single molecule, genome mapping data in a web-based genome bro...
Integration of single molecule, genome mapping data in a web-based genome bro...Integration of single molecule, genome mapping data in a web-based genome bro...
Integration of single molecule, genome mapping data in a web-based genome bro...
 

More from BioinformaticsInstitute

Graph genome
Graph genome Graph genome
A superglue for string comparison
A superglue for string comparisonA superglue for string comparison
A superglue for string comparison
BioinformaticsInstitute
 
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...BioinformaticsInstitute
 
Вперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкВперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днк
BioinformaticsInstitute
 
Knime & bioinformatics
Knime & bioinformaticsKnime & bioinformatics
Knime & bioinformatics
BioinformaticsInstitute
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус
BioinformaticsInstitute
 
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
BioinformaticsInstitute
 
Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)
BioinformaticsInstitute
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
BioinformaticsInstitute
 
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
BioinformaticsInstitute
 
Biodb 2011-05
Biodb 2011-05Biodb 2011-05

More from BioinformaticsInstitute (20)

Graph genome
Graph genome Graph genome
Graph genome
 
A superglue for string comparison
A superglue for string comparisonA superglue for string comparison
A superglue for string comparison
 
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес... Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
Биоинформатический анализ данных полноэкзомного секвенирования: анализ качес...
 
Вперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днкВперед в прошлое. Методы генетической диагностики древней днк
Вперед в прошлое. Методы генетической диагностики древней днк
 
Knime & bioinformatics
Knime & bioinformaticsKnime & bioinformatics
Knime & bioinformatics
 
"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус"Зачем биологам суперкомпьютеры", Александр Предеус
"Зачем биологам суперкомпьютеры", Александр Предеус
 
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
Иммунотерапия раковых опухолей: взгляд со стороны системной биологии. Максим ...
 
Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)Рак 101 (Мария Шутова, ИоГЕН РАН)
Рак 101 (Мария Шутова, ИоГЕН РАН)
 
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
Секвенирование как инструмент исследования сложных фенотипов человека: от ген...
 
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
Инвестиции в биоинформатику и биотех (Андрей Афанасьев)
 
Biodb 2011-everything
Biodb 2011-everythingBiodb 2011-everything
Biodb 2011-everything
 
Biodb 2011-05
Biodb 2011-05Biodb 2011-05
Biodb 2011-05
 
Biodb 2011-04
Biodb 2011-04Biodb 2011-04
Biodb 2011-04
 
Biodb 2011-03
Biodb 2011-03Biodb 2011-03
Biodb 2011-03
 
Biodb 2011-01
Biodb 2011-01Biodb 2011-01
Biodb 2011-01
 
Biodb 2011-02
Biodb 2011-02Biodb 2011-02
Biodb 2011-02
 
Ngs 3 1
Ngs 3 1Ngs 3 1
Ngs 3 1
 
Ngs 1 0_0
Ngs 1 0_0Ngs 1 0_0
Ngs 1 0_0
 
Ngs 2 0_0
Ngs 2 0_0Ngs 2 0_0
Ngs 2 0_0
 
Ngs 7
Ngs 7Ngs 7
Ngs 7
 

Recently uploaded

3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
terusbelajar5
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 

Recently uploaded (20)

3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 

Comparative Genomics and de Bruijn graphs

  • 1. Comparative Genomics and the de Bruijn graphs Ilia Minkin Pennsylvania State University 16th September 2016 1 / 43
  • 2. What is comparative genomics? The collection of all research activities that derive biological insights by comparing genomic features. 1 1Comparative Genomics, Xuhua Xia 2 / 43
  • 3. What is comparative genomics? The collection of all research activities that derive biological insights by comparing genomic features. 1 Why do it? Learn evolution Learn function 1Comparative Genomics, Xuhua Xia 2 / 43
  • 5. Learn Function A genomic sequence itself does not show its functions How to nd function? Compare with sequences of know function Conserved sequences are likely to be important How to compare genomes? 4 / 43
  • 6. What is an Alignment? Organisms inherit genomes but with errors: The Ancestor Genome A Genome B Which characters A and B got from its ancestor? 5 / 43
  • 7. What is an Alignment? Alignments are written down as a table: ACTG-TGA ACTACTGA Blue letters are matches; yellow are mismatches; dashes are indels. This is a global alignment. 6 / 43
  • 8. The Global Alignment ACTG-TGA ACTACTGA For two strings A and B : Place them under each other Insert into A and B dashes so that |A| = |B| Penalize for dashes and mismatches Which alignment gives the least penalty? Complexity: O(|A||B |) 7 / 43
  • 9. The Local Alignment For large sequences the global alignment does not work: GAACTGTGATTAGGACGT ATTTGGGACTACTGAGTA 8 / 43
  • 10. The Local Alignment For large sequences the global alignment does not work: GAACTGTGATTAGGACGT ATTTGGGACTACTGAGTA Apart from indels and mismatches there could be rearrangements Rearrangements change orders of the whole blocks Similar subsequences can be interleaved with something else 8 / 43
  • 11. The Local Alignment GAACTGTGATTAGGACGT ATTTGGGACTACTGAGTA A problem: for A and B nd their most similar subsequences and their alignment: ACTG-TGA ACTACTGA Complexity: O(|A||B |) 9 / 43
  • 12. An Example We can generalize to many genomes: GAACTGTGATTATGCTCA ATTTGGGACTACTGAGTA ATCTTGAGATAGCTGAAA 10 / 43
  • 13. An Example We can generalize to many genomes: GAACTGTGATTATGCTCA ATTTGGGACTACTGAGTA ATCTTGAGATAGCTGAAA Alignments: ACTG-TGA ACTACTGA A-TGCTCA 10 / 43
  • 14. Multiple Local Alignment Issues: Some subsequences can be present in some genomes and absent in others Genomes can have duplications Multiple sequence alignment is NP-hard 11 / 43
  • 15. Multiple Local Alignment Issues: Some subsequences can be present in some genomes and absent in others Genomes can have duplications Multiple sequence alignment is NP-hard → We need some heuristics 11 / 43
  • 16. Another Approach Another way to nd common subsequences is to build a graph from genomes In such a graph homologous subsequences will collapse into non-branching paths while unique ones will form disjoint paths 12 / 43
  • 17. The Linear Representation Two genomes: 13 / 43
  • 18. Solution: a Graph Representation What we want to see: 14 / 43
  • 19. Genomes as a Railroad 15 / 43
  • 20. Why de Bruijn graph? A simple object. Demonstrated utility in: Assembly Read mapping Synteny identication 16 / 43
  • 21. The de Bruijn Graph k = 2 TGACGTC TGACTTC 17 / 43
  • 22. The de Bruijn Graph k = 2 TGACGTC TGACTTC AC GTCGGATG TC 17 / 43
  • 23. The de Bruijn Graph k = 2 TGACGTC TGACTTC AC GTCGGATG TC AC TTGATG TCCT 17 / 43
  • 24. The de Bruijn Graph AC GTCGGATG TC AC TTGATG TCCT 18 / 43
  • 25. The de Bruijn Graph AC GTCGGATG TC AC TTGATG TCCT AC GT TT CG GATG TC CT 18 / 43
  • 26. The de Bruijn graph In the de Bruijn graph identical substrings of length at least k + 1 are collapsed into non-branching paths We can use this to nd homologous blocks. We developed a tool Sibelia that nds such blocks in many bacterial genomes and handles repeats. But we can do more. 19 / 43
  • 27. Alignment to a Graph It is common to have an unassembled genome Reads are then aligned to a very similar reference genome: 20 / 43
  • 28. Alignment to a Graph Issues: More than one reference? Repeats within genomes? 21 / 43
  • 29. Alignment to a Graph Issues: More than one reference? Repeats within genomes? Solution: align reads to a graph! 21 / 43
  • 30. Alignment to a Graph In the future genome graphs will encode information about a population 22 / 43
  • 31. Alignment to a Graph In the future genome graphs will encode information about a population Aligning reads to a graph has many advantages: Ecient alignment to many genomes Reusing information about variants Handling of repeats 22 / 43
  • 32. Alignment to a Graph In the future genome graphs will encode information about a population Aligning reads to a graph has many advantages: Ecient alignment to many genomes Reusing information about variants Handling of repeats The de Bruijn graph is a feasible model for a graph reference. 22 / 43
  • 33. Alignment to a Graph In the future genome graphs will encode information about a population Aligning reads to a graph has many advantages: Ecient alignment to many genomes Reusing information about variants Handling of repeats The de Bruijn graph is a feasible model for a graph reference. Issue the graph can be too large. 22 / 43
  • 36. The Challenge Construct the compacted graph from many large genomes bypassing the ordinary graph traverse. 24 / 43
  • 37. The Challenge Construct the compacted graph from many large genomes bypassing the ordinary graph traverse. Earlier work: based on sux arrays/trees Sibelia SplitMEM handled 60 E.Coli genomes. 24 / 43
  • 38. The Challenge Construct the compacted graph from many large genomes bypassing the ordinary graph traverse. Earlier work: based on sux arrays/trees Sibelia SplitMEM handled 60 E.Coli genomes. A recent advance: 7 Humans in 15 hours using 100 GB of RAM using a BWT-based algorithm by Baier et al., 2015, Beller et al., 2014. 24 / 43
  • 39. Junctions A vertex v is a junction if: v has ≥ 2 distinct outgoing or incoming edges: 25 / 43
  • 40. Junctions A vertex v is a junction if: v has ≥ 2 distinct outgoing or incoming edges: v is the rst or the last k -mer of an input string 25 / 43
  • 41. Junctions A vertex v is a junction if: v has ≥ 2 distinct outgoing or incoming edges: v is the rst or the last k -mer of an input string Facts: Junctions = vertices of the compacted graph Compaction = nding positions of junctions 25 / 43
  • 44. Observations TGAC ACGTC ACTTC TG AC TC TG GA AC CG GT TC TG → AC → TC 26 / 43
  • 45. The Observation The observation only works when we have complete genomes. Once we know junctions, construction of the edges is simple. We can simply traverse input strings and record junctions in the order they appear. How to identify junctions? 27 / 43
  • 46. The Naive Algorithm A naive way: Store all (k + 1)-mers (edges) in a hash table Consider each vertex one by one Query all possible edges from the table If found 1 edge, mark vertex as a junction 28 / 43
  • 47. The Naive Algorithm A naive way: Store all (k + 1)-mers (edges) in a hash table Consider each vertex one by one Query all possible edges from the table If found 1 edge, mark vertex as a junction Problem: the hash table can be too large. 28 / 43
  • 48. An Example Hash table = { GA → AC } AA AG AC AT GA 29 / 43
  • 49. What is the Bloom lter A probabilistic data structure representing a set Properties: Occupies xed space May generate false positives on queries False positive rate is low 30 / 43
  • 50. What is the Bloom lter A probabilistic data structure representing a set Properties: Occupies xed space May generate false positives on queries False positive rate is low Example: Bloom Filter = { GA → AC } Is GA → AC in the set? Yes. 30 / 43
  • 51. What is the Bloom lter A probabilistic data structure representing a set Properties: Occupies xed space May generate false positives on queries False positive rate is low Example: Bloom Filter = { GA → AC } Is GA → AC in the set? Yes. Is GA → AT in the set? Maybe no. 30 / 43
  • 52. An Example Bloom Filter = { GA → AC, GA → AT } AA AG AC AT GA The purple edge is a false positive. 31 / 43
  • 53. The Two Pass Algorithm How to eliminate false positives? 32 / 43
  • 54. The Two Pass Algorithm How to eliminate false positives? Two-pass algorithm: 1. Use the Bloom lter to identify junction candidates 2. Use the hash table, but store only edges that touch candidates 32 / 43
  • 55. An Example: the First Step Here edges stored in the Bloom lter, purple ones are false positives: AC GT CC TT CG AT GATG TC CT Junction candidates: GA AC 33 / 43
  • 56. An Example: the Second Step Edges stored in the hash table. We kept only edges touching junction candidates: Junction: AC 34 / 43
  • 57. Results Datasets: 7 humans: 5 versions of the reference + 2 haplotypes of NA12878 from 1000 Genomes 93 simulated humans (FIGG) 8 primates available in UCSC genome browser 35 / 43
  • 58. Results Running time (minutes) memory usage (GBs). # genomes BWT-based TwoPaCo 1 thread 1 thread 15 threads Humans 7, k = 25 867 (100.30) 436 (4.40) 63 (4.84) 7, k = 100 807 (46.02) 317 (8.42) 57 (8.75) 43+7, k = 25 - - 705 (69.77) 43+7, k = 100 - - 927 (70.21) 93+7, k = 25 - - 1383 (77.42) Primates 8, k = 25 - 914 (34.36) 111 (34.36) 8,k = 100 - 756 (56.06) 101 (61.68) 36 / 43
  • 59. Conclusion Future Work Advantages of the algorithm: Fast Small memory footprint Can handle large inputs Drawbacks: Less applicable for large k 37 / 43
  • 60. Conclusion Future Work Advantages of the algorithm: Fast Small memory footprint Can handle large inputs Drawbacks: Less applicable for large k Take home message: it is easy to construct the compacted de Bruijn graph for complete genomes. 37 / 43
  • 61. Conclusion Future Work Can potentially facilitate: Visualization Synteny mining (Sibelia) Structural variations analysis ... 38 / 43
  • 62. Acknowledgments Personal: Daniel Lemire GFA format working group Funding, NSF awards: DBI-1356529 CCF-1439057 IIS-1453527 IIS-1421908 39 / 43
  • 63. Thank you for your attention! Twitter: @IliaMinkin 40 / 43
  • 64. Input Size vs. Performance 41 / 43
  • 66. Splitting Table 1: The minimal number of rounds it takes to compress the graph without exceeding a given memory threshold. Memory threshold Used memory Bloom lter size Running time Rounds 10 8.62 8.59 259 1 8 6.73 4.29 434 3 6 5.98 4.29 539 4 4 3.51 2.14 665 6 43 / 43