Your SlideShare is downloading. ×
0
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Assembling the Norway Spruce Genome: 20Gb and many challenges, Umeå Plant Sciences Centre, Umeå University, Douglas G. Scofield Copenhagenomics 2012

1,553

Published on

Dr. Douglas G. Scofield is Principal Research Engineer at Umeå Plant Science Centre at Umeå University in Sweden. Slides for his presentation: Assembling the Norway Spruce genome.

Dr. Douglas G. Scofield is Principal Research Engineer at Umeå Plant Science Centre at Umeå University in Sweden. Slides for his presentation: Assembling the Norway Spruce genome.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,553
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Cumbersome pipeline -> rNA much easier!Try to stay with standard software. Saves time and resources for us. One tool that we couldn’t do without. Any guesses?
  • Transcript

    • 1. Assembling the Norway Spruce genome : 20 Gb and many challenges Douglas G. Scofield Umeå Plant Science Centre, Umeå University
    • 2. Picea abies: A Very Large Genome• 20 Gb, nearly 7x human • typical for conifers• Little known about gene and genome structure• Only ~0.1 % consists of genes• Largest genome sequenced so far• Other ongoing conifer sequencing projects • Picea glauca (Canada) • Pinus taeda (US) Arabidopsis Poplar Humans Norway spruce (120 Mbp) (450 Mbp) (3 000 Mbp) (20 000 Mbp)
    • 3. Sequencing overview 2n needle tissue 1n megagametophytes Illumina paired-end • 150-bp insert • 300-bp insert • 650-bp insert 454 single-end (1.5x)Mate-pair libraries 454 and Illumina• 2kbp insert transcriptomes• 4kbp insert Three megagametophytes• 10kbp insert from the reference tree• Fosmid ends (40kbp) ~30-35x input coverage for each~50x input coverage plus, fosmid pools…
    • 4. Fosmid pools: the plan Fosmid library ~40kbp / fosmid Pool .. . . . .. ... ..... . . .. .~1000 fosmids / pool = ~40 Mbp @ 1n, much simpler assembly8 pools / HiSeq lane = 60x coverage with 300-bp PE reads500 pools = 1x genome coverage
    • 5. Fosmid pools: the reality 40% reads lost to E. coli, and 20-50% mitochondrial ~0.5x realized genome coverage BUT … low redundancy among pools, and assembly is easier 100 80 % recovered 60 40 400 20WGS assembly, Mbp 300 0 >5kbp >10kbp 200 >20kbp 100 100 % recovered 80 60 0 0 100 200 300 400 500 600 700 Cumulated assembly of fosmid pools, Mbp 40 20 0
    • 6. Putting it all together WGS CLC (1n) CLC +Fosmid pool BESST GAM : Genomic Assemblies Merger (Casagrande et al.) Merged BESST : Fast, lightweight scaffolder developed at SciLifeLab Final Multiple rounds of GAM, scaffolding with transcripts, …
    • 7. Assembly quality• >80% of genome in scaffolds and contigs• 14% total genome length in scaffolds >10Kb (vs. 1%)• N50 8800 bp (vs. 2900 bp)• Longest contig 1.15 Mb (vs. 206 Kb) Feature Response Curves: Accumulated length of contigs quantifying quality using configurations of mapped paired-end reads
    • 8. Now for a little spruce biology… Genetic diversity Low gene content Allele splitting Repeat content
    • 9. Genetic diversity 0.008 Zygosity correlationε = 2.46 × 10–3 errors/read-base 0.006 – sequencing error + misaligned reads 0.004Θ = 9.47 × 10–3 ± 0.007 × 10–3 0.002 – scaled mutation rate, Θ = 4Neμ Δ, zygosity correlation 0.000 50 100 150 200 250 300 350 400 450 500 550 600 650 700Ne ≈ 1.0 – 1.7 × 106 Distance between sites (bp) – using LTR-based estimates of μ Heterozygous sites 1.4Ho ≈ 0.27% Count (Mb) 1.2 1.0 – 1.42 Gbp examined 0.8 0.6 0.4 0.2 0.0 W (AT) S (CG) K (GT) M (AC) R (AG) Y (CT) IUPAC Code (Genotype)
    • 10. Low gene content
    • 11. Random promoter sequences… 12 bp : 730 locations 8 bp : ~220,000 locations
    • 12. Frequent aberrant transcription?
    • 13. Chromatin structure strongly controlled?
    • 14. Genes clustered?
    • 15. Are there inefficiencies in transcript processing?
    • 16. Allele splitting Median coverage of 2n contigs Median coverage of 1n contigs
    • 17. Identifying allele splitting: self-self blasts
    • 18. Repeat content in contig ends vs. middles20-mers appearing in 100-bp segments from 100K contigs Gepard: http://www.helmholtz-muenchen.de/en/mips/services/analysis-tools/gepard/index.html
    • 19. The repeat landscape in Picea abies (so far)Transposable elements!LINE 0,44LTR copia 13,69 3 human genomesLTR gypsy 29,61 of LTRs!LTR uncl. 15,22DNA_TEs 0,57Unclassified 9,24Total 68,77 Ty1-copia RT
    • 20. Age of LTR insertions4 Mya Time back to LTR insertion (MYA) 85 Mya
    • 21. Norway spruce genome: the state of things• ~80% of genome assembled – still highly fragmented ... repeats!• Fosmid pool / long fragment strategy will work – necessary for spanning repeats and filling gaps – requires improvements in both sequencing and software technologies• Biology – Allele splitting : improved assembly technology needed – ~70% repeats, but ~none of the TEs are active – Gene and genome structure still being unravelled – Comparisons against 5 other conifers
    • 22. The Spruce Genome TeamUPSC SciLifeLabRishikesh Bhalerao Andrey AlexeyenkoSimon Birve Björn AnderssonUlrika Egertsdotter Siv AnderssonIoana Gaboreanu Lars ArvestadRosario Garcia-Gil Frida BerglundPer Gardeström Oscar FranzénThomas Hiltonen Manfred GrabherrTorgeir Hvidsten Kicki HolmbergPär Ingvarsson Lisa KlassonStefan Jansson Max KällerOlivier Keech Joakim LundebergSusanne Larsson Fredrik LysholmChanaka Mannapperuma Björn NystedtOve Nilsson Kristoffer SahlinDouglas Scofield Ellen SherwoodNathaniel Street Anna SköllermoBjörn Sundberg VIB Gent IGA Udine CHORI Oakland Anne-Charlotte SonnhammerStacey Lee Thompson Yves Van de Peer Michele Morgante Pieter de Jong Thomas SvenssonZhi-Qiang Wu Yao-Cheng Lin Francesco Vezzi Maxim Koriabine Carlos Talavera-LopezHarry Wu Riccardo Anna WetterbomSAB VicedominiKerstin Lindblad-Toh Andrea Zuccolo Skogforsk SNIC Supercomputers SNISS national CLCbioJohn MacKay infrastructure Bengt Andersson Uppmax/PDC/NSC/HPC2NOuti Savolainen Lucigen Bo KarlssonDetlef Weigel
    • 23. The spruceThanks forlistening!
    • 24. Data processing outline Quality filtered Remove phiX De novo Raw data data (+ chloroplast) assembly ~6 Tbp rNA, Fastx, FastQC BWA, rNA, FastQC CLC, (Velvet, Newbler) Repeat Assembly Merging of Scaffolding annotation validation assembliesRepeatMasker, Repeat FRC [custom toolkit] GAM [custom tool] BESST [custom tool]Scout, BLAST, custom tools Gene Genome portal Aims (Phase 1) annotation • Public genome resource EUGene • Genes (and gene families) • Repeats • Evolutionary insight Transcriptome sequencing 550 libs 20-30 different lib types

    ×