Assembling the Norway Spruce genome :     20 Gb and many challenges               Douglas G. Scofield   Umeå Plant Science...
Picea abies: A Very Large Genome• 20 Gb, nearly 7x human   • typical for conifers• Little known about gene and genome stru...
Sequencing overview      2n needle tissue                                 1n megagametophytes                            I...
Fosmid pools: the plan       Fosmid library                                ~40kbp / fosmid   Pool                .. . . . ...
Fosmid pools: the reality                              40% reads lost to E. coli, and 20-50% mitochondrial                ...
Putting it all together   WGS                                                                       CLC   (1n)            ...
Assembly quality•   >80% of genome in scaffolds and contigs•   14% total genome length in scaffolds >10Kb (vs. 1%)•   N50 ...
Now for a little spruce biology…        Genetic diversity        Low gene content         Allele splitting         Repeat ...
Genetic diversity                                                                    0.008                                ...
Low gene content
Random promoter sequences…    12 bp : 730 locations  8 bp : ~220,000 locations
Frequent aberrant transcription?
Chromatin structure strongly controlled?
Genes clustered?
Are there inefficiencies in transcript processing?
Allele splitting    Median coverage of 2n contigs   Median coverage of 1n contigs
Identifying allele splitting: self-self blasts
Repeat content in contig ends vs. middles20-mers appearing in 100-bp segments from 100K contigs         Gepard: http://www...
The repeat landscape in Picea abies (so far)Transposable elements!LINE           0,44LTR copia      13,69                 ...
Age of LTR insertions4 Mya        Time back to LTR insertion (MYA)   85 Mya
Norway spruce genome: the state of things• ~80% of genome assembled   – still highly fragmented ... repeats!• Fosmid pool ...
The Spruce Genome TeamUPSC                                                                               SciLifeLabRishike...
The spruceThanks forlistening!
Data processing outline                             Quality filtered                Remove phiX                    De novo...
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Sco...
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Sco...
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Sco...
Upcoming SlideShare
Loading in …5
×

Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012

2,094 views

Published on

Dr. Douglas G. Scofield is Principal Research Engineer at Umeå Plant Science Centre at Umeå University in Sweden. Slides for his presentation: Assembling the Norway Spruce genome.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,094
On SlideShare
0
From Embeds
0
Number of Embeds
313
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Cumbersome pipeline -> rNA much easier!Try to stay with standard software. Saves time and resources for us. One tool that we couldn’t do without. Any guesses?
  • Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012

    1. 1. Assembling the Norway Spruce genome : 20 Gb and many challenges Douglas G. Scofield Umeå Plant Science Centre, Umeå University
    2. 2. Picea abies: A Very Large Genome• 20 Gb, nearly 7x human • typical for conifers• Little known about gene and genome structure• Only ~0.1 % consists of genes• Largest genome sequenced so far• Other ongoing conifer sequencing projects • Picea glauca (Canada) • Pinus taeda (US) Arabidopsis Poplar Humans Norway spruce (120 Mbp) (450 Mbp) (3 000 Mbp) (20 000 Mbp)
    3. 3. Sequencing overview 2n needle tissue 1n megagametophytes Illumina paired-end • 150-bp insert • 300-bp insert • 650-bp insert 454 single-end (1.5x)Mate-pair libraries 454 and Illumina• 2kbp insert transcriptomes• 4kbp insert Three megagametophytes• 10kbp insert from the reference tree• Fosmid ends (40kbp) ~30-35x input coverage for each~50x input coverage plus, fosmid pools…
    4. 4. Fosmid pools: the plan Fosmid library ~40kbp / fosmid Pool .. . . . .. ... ..... . . .. .~1000 fosmids / pool = ~40 Mbp @ 1n, much simpler assembly8 pools / HiSeq lane = 60x coverage with 300-bp PE reads500 pools = 1x genome coverage
    5. 5. Fosmid pools: the reality 40% reads lost to E. coli, and 20-50% mitochondrial ~0.5x realized genome coverage BUT … low redundancy among pools, and assembly is easier 100 80 % recovered 60 40 400 20WGS assembly, Mbp 300 0 >5kbp >10kbp 200 >20kbp 100 100 % recovered 80 60 0 0 100 200 300 400 500 600 700 Cumulated assembly of fosmid pools, Mbp 40 20 0
    6. 6. Putting it all together WGS CLC (1n) CLC +Fosmid pool BESST GAM : Genomic Assemblies Merger (Casagrande et al.) Merged BESST : Fast, lightweight scaffolder developed at SciLifeLab Final Multiple rounds of GAM, scaffolding with transcripts, …
    7. 7. Assembly quality• >80% of genome in scaffolds and contigs• 14% total genome length in scaffolds >10Kb (vs. 1%)• N50 8800 bp (vs. 2900 bp)• Longest contig 1.15 Mb (vs. 206 Kb) Feature Response Curves: Accumulated length of contigs quantifying quality using configurations of mapped paired-end reads
    8. 8. Now for a little spruce biology… Genetic diversity Low gene content Allele splitting Repeat content
    9. 9. Genetic diversity 0.008 Zygosity correlationε = 2.46 × 10–3 errors/read-base 0.006 – sequencing error + misaligned reads 0.004Θ = 9.47 × 10–3 ± 0.007 × 10–3 0.002 – scaled mutation rate, Θ = 4Neμ Δ, zygosity correlation 0.000 50 100 150 200 250 300 350 400 450 500 550 600 650 700Ne ≈ 1.0 – 1.7 × 106 Distance between sites (bp) – using LTR-based estimates of μ Heterozygous sites 1.4Ho ≈ 0.27% Count (Mb) 1.2 1.0 – 1.42 Gbp examined 0.8 0.6 0.4 0.2 0.0 W (AT) S (CG) K (GT) M (AC) R (AG) Y (CT) IUPAC Code (Genotype)
    10. 10. Low gene content
    11. 11. Random promoter sequences… 12 bp : 730 locations 8 bp : ~220,000 locations
    12. 12. Frequent aberrant transcription?
    13. 13. Chromatin structure strongly controlled?
    14. 14. Genes clustered?
    15. 15. Are there inefficiencies in transcript processing?
    16. 16. Allele splitting Median coverage of 2n contigs Median coverage of 1n contigs
    17. 17. Identifying allele splitting: self-self blasts
    18. 18. Repeat content in contig ends vs. middles20-mers appearing in 100-bp segments from 100K contigs Gepard: http://www.helmholtz-muenchen.de/en/mips/services/analysis-tools/gepard/index.html
    19. 19. The repeat landscape in Picea abies (so far)Transposable elements!LINE 0,44LTR copia 13,69 3 human genomesLTR gypsy 29,61 of LTRs!LTR uncl. 15,22DNA_TEs 0,57Unclassified 9,24Total 68,77 Ty1-copia RT
    20. 20. Age of LTR insertions4 Mya Time back to LTR insertion (MYA) 85 Mya
    21. 21. Norway spruce genome: the state of things• ~80% of genome assembled – still highly fragmented ... repeats!• Fosmid pool / long fragment strategy will work – necessary for spanning repeats and filling gaps – requires improvements in both sequencing and software technologies• Biology – Allele splitting : improved assembly technology needed – ~70% repeats, but ~none of the TEs are active – Gene and genome structure still being unravelled – Comparisons against 5 other conifers
    22. 22. The Spruce Genome TeamUPSC SciLifeLabRishikesh Bhalerao Andrey AlexeyenkoSimon Birve Björn AnderssonUlrika Egertsdotter Siv AnderssonIoana Gaboreanu Lars ArvestadRosario Garcia-Gil Frida BerglundPer Gardeström Oscar FranzénThomas Hiltonen Manfred GrabherrTorgeir Hvidsten Kicki HolmbergPär Ingvarsson Lisa KlassonStefan Jansson Max KällerOlivier Keech Joakim LundebergSusanne Larsson Fredrik LysholmChanaka Mannapperuma Björn NystedtOve Nilsson Kristoffer SahlinDouglas Scofield Ellen SherwoodNathaniel Street Anna SköllermoBjörn Sundberg VIB Gent IGA Udine CHORI Oakland Anne-Charlotte SonnhammerStacey Lee Thompson Yves Van de Peer Michele Morgante Pieter de Jong Thomas SvenssonZhi-Qiang Wu Yao-Cheng Lin Francesco Vezzi Maxim Koriabine Carlos Talavera-LopezHarry Wu Riccardo Anna WetterbomSAB VicedominiKerstin Lindblad-Toh Andrea Zuccolo Skogforsk SNIC Supercomputers SNISS national CLCbioJohn MacKay infrastructure Bengt Andersson Uppmax/PDC/NSC/HPC2NOuti Savolainen Lucigen Bo KarlssonDetlef Weigel
    23. 23. The spruceThanks forlistening!
    24. 24. Data processing outline Quality filtered Remove phiX De novo Raw data data (+ chloroplast) assembly ~6 Tbp rNA, Fastx, FastQC BWA, rNA, FastQC CLC, (Velvet, Newbler) Repeat Assembly Merging of Scaffolding annotation validation assembliesRepeatMasker, Repeat FRC [custom toolkit] GAM [custom tool] BESST [custom tool]Scout, BLAST, custom tools Gene Genome portal Aims (Phase 1) annotation • Public genome resource EUGene • Genes (and gene families) • Repeats • Evolutionary insight Transcriptome sequencing 550 libs 20-30 different lib types

    ×