Dr. Douglas G. Scofield is Principal Research Engineer at Umeå Plant Science Centre at Umeå University in Sweden. Slides for his presentation: Assembling the Norway Spruce genome.
¿Podrá Europa reducir la prevalencia de consumo al 30% en 2025?
Similar to Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Ben Shafer Asa Seattle2011 Fri Aft 5p A Aaotishobbes
Similar to Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012 (8)
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
1. Assembling the Norway Spruce genome :
20 Gb and many challenges
Douglas G. Scofield
Umeå Plant Science Centre, Umeå University
2. Picea abies: A Very Large Genome
• 20 Gb, nearly 7x human
• typical for conifers
• Little known about gene and genome structure
• Only ~0.1 % consists of genes
• Largest genome sequenced so far
• Other ongoing conifer sequencing projects
• Picea glauca (Canada)
• Pinus taeda (US)
Arabidopsis Poplar Humans Norway spruce
(120 Mbp) (450 Mbp) (3 000 Mbp) (20 000 Mbp)
3. Sequencing overview
2n needle tissue 1n megagametophytes
Illumina paired-end
• 150-bp insert
• 300-bp insert
• 650-bp insert
454 single-end (1.5x)
Mate-pair libraries 454 and Illumina
• 2kbp insert transcriptomes
• 4kbp insert Three megagametophytes
• 10kbp insert from the reference tree
• Fosmid ends (40kbp)
~30-35x input coverage
for each
~50x input coverage
plus, fosmid pools…
4. Fosmid pools: the plan
Fosmid library
~40kbp / fosmid
Pool
.. . . . .. ... .....
. . .. .
~1000 fosmids / pool = ~40 Mbp @ 1n, much simpler assembly
8 pools / HiSeq lane = 60x coverage with 300-bp PE reads
500 pools = 1x genome coverage
5. Fosmid pools: the reality
40% reads lost to E. coli, and 20-50% mitochondrial
~0.5x realized genome coverage
BUT … low redundancy among pools, and assembly is easier
100
80
% recovered
60
40
400
20
WGS assembly, Mbp
300
0 >5kbp
>10kbp
200
>20kbp
100
100
% recovered
80
60
0
0 100 200 300 400 500 600 700
Cumulated assembly of fosmid pools, Mbp
40
20
0
6. Putting it all together
WGS CLC
(1n)
CLC +
Fosmid pool BESST
GAM : Genomic Assemblies Merger (Casagrande et al.)
Merged
BESST : Fast, lightweight scaffolder developed at SciLifeLab
Final
Multiple rounds of GAM, scaffolding with transcripts, …
7. Assembly quality
• >80% of genome in scaffolds and contigs
• 14% total genome length in scaffolds >10Kb (vs. 1%)
• N50 8800 bp (vs. 2900 bp)
• Longest contig 1.15 Mb (vs. 206 Kb)
Feature Response Curves:
Accumulated length of contigs quantifying quality using configurations
of mapped paired-end reads
8. Now for a little spruce biology…
Genetic diversity
Low gene content
Allele splitting
Repeat content
9. Genetic diversity
0.008
Zygosity correlation
ε = 2.46 × 10–3 errors/read-base
0.006
– sequencing error + misaligned reads
0.004
Θ = 9.47 × 10–3 ± 0.007 × 10–3
0.002
– scaled mutation rate, Θ = 4Neμ Δ, zygosity correlation
0.000
50 100 150 200 250 300 350 400 450 500 550 600 650 700
Ne ≈ 1.0 – 1.7 × 106 Distance between sites (bp)
– using LTR-based estimates of μ
Heterozygous sites
1.4
Ho ≈ 0.27% Count (Mb) 1.2
1.0
– 1.42 Gbp examined 0.8
0.6
0.4
0.2
0.0
W (AT) S (CG) K (GT) M (AC) R (AG) Y (CT)
IUPAC Code (Genotype)
21. Repeat content in contig ends vs. middles
20-mers appearing in 100-bp segments from 100K contigs
Gepard: http://www.helmholtz-muenchen.de/en/mips/services/analysis-tools/gepard/index.html
22. The repeat landscape in Picea abies (so far)
Transposable elements!
LINE 0,44
LTR copia 13,69
3 human genomes
LTR gypsy 29,61
of LTRs!
LTR uncl. 15,22
DNA_TEs 0,57
Unclassified 9,24
Total 68,77
Ty1-copia RT
23. Age of LTR insertions
4 Mya Time back to LTR insertion (MYA) 85 Mya
24. Norway spruce genome: the state of things
• ~80% of genome assembled
– still highly fragmented ... repeats!
• Fosmid pool / long fragment strategy will work
– necessary for spanning repeats and filling gaps
– requires improvements in both sequencing and software
technologies
• Biology
– Allele splitting : improved assembly technology needed
– ~70% repeats, but ~none of the TEs are active
– Gene and genome structure still being unravelled
– Comparisons against 5 other conifers
25. The Spruce Genome Team
UPSC SciLifeLab
Rishikesh Bhalerao Andrey Alexeyenko
Simon Birve Björn Andersson
Ulrika Egertsdotter Siv Andersson
Ioana Gaboreanu Lars Arvestad
Rosario Garcia-Gil Frida Berglund
Per Gardeström Oscar Franzén
Thomas Hiltonen Manfred Grabherr
Torgeir Hvidsten Kicki Holmberg
Pär Ingvarsson Lisa Klasson
Stefan Jansson Max Käller
Olivier Keech Joakim Lundeberg
Susanne Larsson Fredrik Lysholm
Chanaka Mannapperuma Björn Nystedt
Ove Nilsson Kristoffer Sahlin
Douglas Scofield Ellen Sherwood
Nathaniel Street Anna Sköllermo
Björn Sundberg VIB Gent IGA Udine CHORI Oakland Anne-Charlotte Sonnhammer
Stacey Lee Thompson Yves Van de Peer Michele Morgante Pieter de Jong Thomas Svensson
Zhi-Qiang Wu Yao-Cheng Lin Francesco Vezzi Maxim Koriabine Carlos Talavera-Lopez
Harry Wu Riccardo Anna Wetterbom
SAB Vicedomini
Kerstin Lindblad-Toh Andrea Zuccolo
Skogforsk SNIC Supercomputers SNISS national CLCbio
John MacKay infrastructure
Bengt Andersson Uppmax/PDC/NSC/HPC2N
Outi Savolainen Lucigen
Bo Karlsson
Detlef Weigel
27. Data processing outline
Quality filtered Remove phiX De novo
Raw data
data (+ chloroplast) assembly
~6 Tbp rNA, Fastx, FastQC BWA, rNA, FastQC CLC, (Velvet, Newbler)
Repeat Assembly Merging of
Scaffolding
annotation validation assemblies
RepeatMasker, Repeat FRC [custom toolkit] GAM [custom tool] BESST [custom tool]
Scout, BLAST, custom tools
Gene
Genome portal Aims (Phase 1)
annotation
• Public genome resource
EUGene • Genes (and gene families)
• Repeats
• Evolutionary insight
Transcriptome
sequencing
550 libs
20-30 different lib types
Editor's Notes
Cumbersome pipeline -> rNA much easier!Try to stay with standard software. Saves time and resources for us. One tool that we couldn’t do without. Any guesses?