Taras Oleksyk at #ICG12: Innovative assembly strategy contributes to the understanding of evolution and conservation genetics of the critically endangered Solenodon paradoxus from the island of Hispaniola

Novel assembly approach for
the homozygous genomes
and the conservation
of critically endangered
Solenodon paradoxus
Taras K Oleksyk et al.

The Caribbean islands
of Puerto Rico
and Hispaniola

Solenodon paradoxus
• One of the only two critically endangered
solenodon species found on the largest Caribbean
islands: Cuba and Hispaniola
• One of the most ancient branches among the
placental mammals: divergence dates to the
Cretaceous era, ~76 MYA before the extinction of
dinosaurs (Roca et al., 2004,)

Questions
• What millions of years of isolation can do to a genome?
• Our earlier analysis supported the speciation at ~76 MYA
originally proposed Roca et al. 2014 study (Brandt et al.,
2016)
• but was contested by a recent analysis of five nuclear genes
to <60 Mya, also suggesting over-water dispersal
• Morphometric and mtDNA studies of Hispaniolan solenodon
suggest that southern and northern populations may
represent distinctive subspecies that split 171 KYA
• if confirmed, there is a need to define conservation units, and
describe variation in each

Expedition members
• Juan Carlos Martinez-Cruzado – UPRM
• Yashira Afanador - UPRM
• Liz A. Paulino – INTEQ
• Adriel Nunez – ZooDom
• Nicolas and Yimel De J. Corona

Sequencing results
The genome size has been estimated using KmerGenie 2.06Gbp.
tomaximizeinformationderivedfromdata?
Province Site Coordinates Sex
Weight
(g)
Loc
North
Puerto Plata Puerto Plata Unknown F 886
Zoo
Espaillat
Cordillera
Septentrional
Unknown - - Zoo
El Seybo El Seybo Unknown M 932 Zoo
Higuey La Altagracia Unknown M 758 Zoo
South
Pedernales
La Cañada del
Verraco
N 18o 09’ 9.64”
W 710 43’ 12.0”
M 579
Wild
K
Pedernales
La Cañada del
Verraco
N 18o 09’ 9.64”
W 710 43’ 12.0”
M 1020 Wild L
Pedernales El Manguito -1
N 180 06’ 36.6”
W 710 43’ 3.58”
M 1270
Wild
M
Pedernales El Manguito -1
N 180 06’ 36.6”
W 710 43’ 3.58”
F 1420 Wild N
Pedernales El Manguito - 2
N 180 07’ 6.5”
W 710 43’ 14.7”
F 1120
Wild
O
zoo
zoo
zoo

Sequencing results
The genome size has been estimated using KmerGenie 2.06Gbp.
tomaximizeinformationderivedfromdata?

Choices for the assembly approach
given the data
b | de Bruijn assembly. Reads are decomposed into
overlapping k-mers. Contigs are formed by merging
chains of k-mers until repeat boundaries are reached.
If a k-mer appears multiple times, all duplicates are
discarded.
c |String graph assembly. Align all the reads.
Alignments that can be transitively inferred from all
pairwise alignments are removed. A graph is created
with a vertex for the endpoint of every read.
As a string/unitig graph encodes every valid assembly
of reads, such a graph, if correct, is in fact a lossless
representation of reads.
When there is allelic variation, alternative paths in the
graph are formed.
Genetic variation and the de novo assembly of human genomes
Chaisson, Wilson, & Eichler. Nature Reviews Genetics 16, 627–640 (2015)

Comparative Assembly Results
Assembly Names: A B C D
Contig assembly tool: Fermi SOAPdenovo2
Assembly Metrics
Total contigs (>1,000 bp) 71,429 189,566
Contig N50 54,944 4,048
Contig CEGMA (%) * 96.37(77.42) 68.15(33.06)
Contig BUSCO (%) 86(65) 42(21)
Scaffolding tool:SOAPdenovo2 SSPACE SOAPdenovo2 SSPACE
Gap closing tool: GapCloser GapCloser GapCloser GapCloser
Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly
Li. Bioinformatics 15;28(14):1838-44 (2012)
SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.
Luo, Liu, Xie, Li, Huang, Yuan, et al. Gigascience. BioMed Central; 2012;1:18

Assembly Metrics
Total contigs (>1,000 bp) 71,429 189,566
Contig N50 54,944 4,048
Contig CEGMA (%) * 96.37(77.42) 68.15(33.06)
Contig BUSCO (%) 86(65) 42(21)
Gap closing tool: GapCloser
Total scaffolds (>1,000 bp) 14,417 40,372 20,466 -
Final N50 555,585 110,915 331,639 -
Final CEGMA (%) 95.56(81.85) 95.97(88.71) 95.97(90.73) -
Final BUSCO (%) 91(74) 86(64) 94(80) -

A B
Distribution of gene prediction support
Proteins of four reference species S.
araneus, Erinaceus europaeus, Homo
sapiens and Mus musculus were
aligned to a S. paradoxus assembly
with Exonerate with a maximum of
three hits per protein.
Coding sequences (CDS) were cut, clustered and uploaded into the
AUGUSTUS. Proteins from the predicted genes were aligned by
HMMER and BLAST to Pfam and Swiss-Prot databases. Only the genes
supported by hits to protein databases and hints were retained.
Significantly more transcripts have higher hint support in assembly B.

Assembly Metrics
Total contigs (>1,000 bp) 71,429 189,566
Contig N50 54,944 4,048
Total scaffolds (>1,000 bp) 14,417 40,372 20,466 -
Scaffold N50 555,585 110,915 331,639 -
REAPR error-free bases (%) 96.46 95.35 94.98 -
REAPR low-scoring regions 18 16 71 -
REAPR incorrectly oriented reads 11,543 5,329 28,964 -

Comparing assemblies
Approach Issues Assembly A Assembly B Assembly C
REAPR Low scoring regions
Incorrectly oriented reads
18
11,543
16
5,329
71
28,964
Progressive
Cactus
Inversions
Translocations
87
5
34
0
81
2
Applying
“Occam’s Razor”

Assembly B seems to be the best assembly
• 3x less number of contigs >1000bp
and 14x larger N50
• Scaffolds are shorter, but contain
less low-scoring regions and
incorrectly oriented reads
• Has less inversions and
translocations compared to
another genome
• Contains more transcripts with
higher hint support
• More support available

The inferred divergence time of S. paradoxus
from other mammals is 73.6 Mya - confirmed
(95% confidence interval of 61.4-88.2 Mya)
• Divergence time
estimates based on four-
fold degenerate sites and
on fossil-based priors
• The 95% confidence
intervals are given in
square brackets and
depicted as
semitransparent boxes
around the nodes
• Confirmed Roca et al.
2004, and Brandt et al.,
2016 estimates

Homozygosity &
demographic history
• Solenodon is among of the most
homozygous mammals known, with
variation at least at the level of Amur tiger
• the real number is probably lower, since this
estimate is based on the combined genome of
five individuals
• Patterns of SNP variation allowed us to
infer population demography, which
indicated that northern and southern
subspecies split at least 300 Kya.
• Also: Annotations of genome (genes, repeats),
signatures of selection, evolution of venom genes
• Developed population markers (M-sats) for
conservation studies

Assembly B makes
the next assembly possible
Short Read
Input Assembly
Dovetail
HiRise Assembly
Total Length 2,049.42 Mb 2,053.16 Mb
L50/N50
5,328 scaffolds;
0.111 Mb
16 scaffolds;
42.790 Mb
L90/N90
19,167 scaffolds; 0.028
Mb
51 scaffolds;
7.507 Mb
The genome size 2.06 Gbp
Estimated physical coverage (1-100 kb pairs): 116.57X
Collaboration: Harris Lewin

Why stop here?
• Putting the reference quality genome
• Comparative genomics – Cuban solenodon genome
• Understanding island genome evolution
• Population and conservation genomics

Thank you
• Sergey Kliver
• Pavel Dobrynin
• Aleksey
Komissarov
• Ksenia
Krasheninnikova
• Stephen J. O’Brien
• Kirill Grigorev
• Yashira M. Afanador
• Walter Wolfsberger
• Audrey J. Majeske
• Juan Carlos Martinez-Cruzado
• Liz A. Paulino
• Rosanna Carreras
• Luis E. Rodríguez
• Adrell Núñez
• David Hernández-Martich
• Filipe Silva
• Agostinho Antunes
NSF project #1432092
• Alfred L. Roca
• Adam Brandt

Taras Oleksyk at #ICG12: Innovative assembly strategy contributes to the understanding of evolution and conservation genetics of the critically endangered Solenodon paradoxus from the island of Hispaniola

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Taras Oleksyk at #ICG12: Innovative assembly strategy contributes to the understanding of evolution and conservation genetics of the critically endangered Solenodon paradoxus from the island of Hispaniola

Similar to Taras Oleksyk at #ICG12: Innovative assembly strategy contributes to the understanding of evolution and conservation genetics of the critically endangered Solenodon paradoxus from the island of Hispaniola (20)

More from GigaScience, BGI Hong Kong

More from GigaScience, BGI Hong Kong (20)

Recently uploaded

Recently uploaded (20)

Taras Oleksyk at #ICG12: Innovative assembly strategy contributes to the understanding of evolution and conservation genetics of the critically endangered Solenodon paradoxus from the island of Hispaniola

Editor's Notes