Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2013 pag-poultry-workshop


Published on

  • Be the first to comment

  • Be the first to like this

2013 pag-poultry-workshop

  1. 1. Evaluating and improving the chick genome & transcriptome C. Titus Brown Asst Prof, CSE and Microbiology; BEACON NSF STC Michigan State University
  2. 2. AcknowledgementsThis is joint work with Hans Cheng (USDA ADOL), Jerry Dodgson (MSU).Likit Preeyanon (MSU) and Alexis Black Pyrkosz (ADOL) did the work.All of the software discussed in this talk is available. This work was primarily supported by the USDA NIFA through a grant to me.
  3. 3. Simulations show that incomplete gene reference=> inaccurate differential expression from mRNAseq Single End Reads Paired End Reads % Transcripts Expressed Inaccurately (2-fold Difference) % Transcripts Expressed Inaccurately (2-fold Difference) 100% 100% 10 10 0% 0% 90% 90% ex ex pr pr 80% e ss 80% es io sio 75 n 75 n 70% % 70% % ex ex pre pre ss s 60% ion 60% sio n 50% 50% 50% expr 50% ex p essio ress n ion 40% 40% 30% 25% expressi 30% 25% expre on ssion 20% 20% 10% 10% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% % Reference Completeness % Reference Completeness Alexis Black Pyrkosz
  4. 4. Existing chick gene models lack exons,isoforms Our data Models *This gene contains at least 4 isoforms. Likit Preeyanon
  5. 5. (Exon detection is pretty good.) Likit Preeyanon
  6. 6. Different approaches to gene set predictionyield distinct splice junction predictions > 95% of thee assembly-based splice junctions are supported by 4 or more independent reads. Likit Preeyanon
  7. 7. mRNAseq analysis with a combined de novoand genome-based approach. Likit Preeyanon
  8. 8. We can produce combined gene models. Cufflinks (ref based) + de novo assembly + known mRNA
  9. 9. Gene Model Summary (note: spleen mRNAseq) Method Gene TranscriptGlobal Assembly 14,832 32,311Local Assembly 15,297 23,028Global + Local Assembly 15,934 46,797 *Number of genes and transcripts might be overdue to incomplete assembly and spurious splice junctions.
  10. 10. Cross-validation with technical replicates Dataset Single-end Paired-end Mapped Unmapped Mapped UnmappedLine 6 uninfected 18,375,966 5,203,586 21,598,218 12,065,659 (77.93%) (22.07%) (64.16%) (35.84%)Line 6 infected 17,160,695 6,288,286 15,274,638 8633855 (73.18%) (26.82%) (63.89%) (36.11%)Line 7 uninfected 18,130,072 5,795,737 20,961,033 11,960,299 (75.77%) (24.22%) (63.67%) (36.33%)Line 7 infected 19,912,046 5,450,521 22,485,833 11,992,002 (78.51%) (21.49%) (65.22%) (34.78%) Single-ended reads were used to generate gene models; paired-end data was used as technical replicate cross-validation.
  11. 11. Gene Modeler Pipeline (“gimme”) Merge transcripts together based on transcript mapping to genome; can include existing gene predictions, & iteratively combine predictions. Construct gene models Remove redundant sequences Predict strands and ORFs Likit Preeyanon
  12. 12. Next problem: chick reference! We like using the reference genome to scaffold RNAseq contigs; purely de novo RNAseq assembly is messy. Genomes are also useful for other things, we hear.Problems: Poor sensitivity: the chick genome is missing a substantial number of genes from microchromosomes: 723 genes from HSA19q missing from chicken galGal4. ESTs and RNAseq transcripts for many or most. Gaps 9900 gaps on ordered chromosomes 21k gaps on chr-aligned but low-confidence/unaligned Over-collapsed tandem dups and under-collapsed het
  13. 13. Sensitivity – where is the problem?Are microchromosomes hard to sequence or is microchromosomal sequence hard to assemble?Sequences that simply don’t show up in the data are hard to include in the assembly… Unclonable (Sanger) Strong GC or AT biasSequences with biased (generally low) coverage are often discarded by assemblers.
  14. 14. Can we “even out” coverage?(Digital normalization) If you have two loci, or two mRNA species, with uneven coverage, can you remove the extra coverage?
  15. 15. Coverage before digital normalization: (MD amplified)
  16. 16. Coverage after digital normalization: Normalizes coverage Discards redundancy Eliminates majority of errors Scales assembly dramatically. Assembly is 98% identical.
  17. 17. Prelim results from digitalnormalizationReassembled chick genome contigs from 70x Illumina -> normalized reads in ~24 hours.Obtained 40 Mbp of assembled contigs that were not present in galGal4.Contig assembly contained partial or complete matches to 70% of previously unmappable transcripts assembled from chick spleen mRNAseq.⇒Bioinformatics remedies may help but are probably not sufficient. Likit Preeyanon
  18. 18. Can we improve the assembly? Read cleaning and improvement 1. Digital normalization evens out relative coverage, permitting recovery of difficult- to-sequence regions in assemblies. 2. Error correction and read-to-graph Selection of concordance editing collapses strategies and heterozygous regions. parameters 3. Paired-end de Bruijn graphs can be used to include long-distance constraints in primary contig assembly. 4. RNAseq data indicates contigs that can be combined into scaffolds. Assembly assessment 1. High-abundance k-mers present in the sequence data but missing from the assembly indicate poor sensitivity. 2. Discordant long-insert mate pairs Contig assembly indicate potentially erroneous contigs and and/or scaffolds. scaffolding 3. De novo RNAseq assembly can identify likely misassemblies and positively identify missing genomic sequence.
  19. 19. slides from ; Lex NederbragtLonger reads! Repeat copy 1 Repeat copy 2 Long reads can span repeats and heterozygous regions Polymorphic contig 22 Polymorphic contig Contig 1 Contig 4 Polymorphic contig 33 Polymorphic contig
  20. 20. slides from ; Lex NederbragtPacBio: first results (cod/salmon) Raw reads
  21. 21. Cod: PacBio results Mapping to the published genome 11.4 kbp subread 10.6 kbp subread 10.9 kbp subread slides from ; Lex Nederbragt
  22. 22. Need to combine Illumina + PacBio still. P_errorCorrection pipeline from  93% of reads recovered 2.7x Alignments of at least 1kb to cod published assembly + Error-corrected reads 23x s + w rea d Ra 24 cpus 4.5 days 100 Gb RAMslides from ; Lex
  23. 23. Concluding thoughts/commentsGene models and reference genome both need work.This is going to be a continuing process…Together with Wes Warren (WUSTL), Hans Cheng (USDA ADOL), Jerry Dodgson (MSU) proposing to apply PacBio sequencing and digital normalization to improve chick genome and regularly integrate community improvements; should be generalizable approach. Questions? Contact me at: