Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

High quality arthropod genome assembly with single molecule reads and long-range scaffolding


Published on

The Asian citrus psyllid (Diaphorina citri Kuwayama) is the insect vector of the bacterium Candidatus Liberibacter asiaticus (CLas), the causal agent for the citrus greening or Huanglongbing disease which threatens citrus industry worldwide. This vector is the primary target of approaches to stop the spread of the pathogen.
Single copy marker analysis using BUSCO of the current genome shows a significant proportion of 3,350 single-copy markers, which are conserved in Hemipterans, to be missing (25%) with only 74% present in full-length copies. The manual genome annotation identified a number of misassemblies and missing genes in the current genome. This is, in-part, due to the complexity introduced when assembling a heterogeneous sample containing DNA from multiple psyllids and potentially exacerbated by the use of short reads. To improve quality of genome assembly, we have generated 36.2Gb of Pacbio long reads from 41 SMRT cells with a coverage of 80X for the 400-450Mb genome. The Canu assembler was used to create an interim assembly (Diaci v1.9) with a contig N50 of 115.8kb and 8300 contigs. We are employing Dovetail chicago libraries and 10X Illumina library generated from a single psyllid in conjunction with Bionano optical maps to achieve long-range scaffolding of the genome. The final assembly will be polished with Pacbio and Illumina paired-end reads followed by scaffolding with Illumina mate-pair reads. This will be the first time all these methods have been applied to resolve an insect genome from a highly heterogeneous sample. The new assembly will be available on

Published in: Science
  • Be the first to comment

High quality arthropod genome assembly with single molecule reads and long-range scaffolding

  1. 1. High quality arthropod genome assembly with single molecule reads and long-range scaffolding Prashant S Hosmani1, Mirella Flores-Gonzalez1, Wayne Hunter2, Lukas A. Mueller1, Susan Brown3, and Surya Saha1 1Boyce Thompson Institute; 2USDA-ARS U.S. Horticultural Research Laboratory; 3Kansas State University @SahaSurya Entomology 2017 Advances in Arthropod Genomics Workshop
  2. 2. Acknowledgements Mueller Lab Mirella Flores Prashant Hosmani Kansas State University Sue Brown Cornell University/BTI Michelle (Cilia) Heck USDA/ARS Wayne Hunter Robert Shatters University of California, Davis Carolyn Slupsky Indian River State College Tom D’elia
  3. 3. Citrus Greening: Huanglongbing • Most significant disease of citrus worldwide • More than $4.5 billion in lost citrus production and more than 8,200 lost jobs (2006/07 to 2010/11) • Associated with gram negative bacterium Candidatus Liberibacter asiaticus (CLas) • Spread by insect vector, Diaphorina citri (Asian citrus psyllid, ACP) Annie Kruse
  4. 4. Omics resources and databases are required for identification of targets for interdiction 4 Genome Annotation Target for interdiction molecules Pathway Databases Expression Networks ……. Host Vector Pathogen
  5. 5. Genome Diaci1.1 Contigs 161,988 Total Length 485 Mb Longest 1 Mb Shortest 201bp Ns 19.3 Mb Scaffold N50: 109,898 bp Contig N50: 34,407bp Highly fragmented Many examples of misassemblies!! Current Illumina assembly
  6. 6. Pacbio assembly Error rate 0.013 Error rate 0.015 Number of contigs 7,832 8,030 Total bases 462.8 Mb 493.1 Mb Longest 1.6 Mb 1.7 Mb Shortest 4.4 Kbp 5 Kbp Average length 59.9 Kb 61.4 Kb Contig N50 85.8 Kb 92.6 Kb Koren 2017 Contiguous assembly with longer contigs Multiple individuals in DNA sample
  7. 7. PBJelly scaffolding Canu assembly Scaffolded Assembly v1.9 Number of contigs 7,832 8,352 Total bases 462.8 Mb 591.7 Mb Longest 1.6 Mb 2 Mb Shortest 4.4 Kb 1.5 Kb Average length 59 Kb 70.8 Kb Contig N50 85.8 Kb 115.8 Kb 5,290 gap extensions 535 gaps filled Number of Ns: 0 bp English 2012
  8. 8. v1.91 v1.92 REFERENCE v1.92 ALTERNATE Number of contigs 3,681 1,918 1,763 Total bases 596 Mb 513 Mb 83.4 Mb Longest 4.2 Mb 4.2 Mb 760.6 Kb Shortest 1.5 Kb 6 Kb 1.5 Kb Average length 162 Kb 267 Kb 47.3 Kb Contig N50 620 Kb 755.7 Kb 75.1 Kb Ns 5.1 Mb 4.6 Mb 467 Kb 500ng input DNA from single male psyllid Duplicated contigs added to alternate assembly Error correction • DNA sequencing data • RNA sequencing data • Duplication removal • Scaffolding scaffolding
  9. 9. Gene isoform sequencing (Iso-Seq) Accurate gene models are necessary for targeting assays • Majority of genes are alternatively spliced to produce multiple transcript isoforms. • Iso-Seq generates full-length cDNA sequences (full-length transcripts and gene isoforms). Current MCOT (de novo and genome-based) transcriptome is useful but fragmented Korf 2013
  10. 10. Sequencing full-length gene isoforms
  11. 11. Mapping to D. citri genome Isoforms mapped to D. citri v1.92 Total isoforms: 314,275 Isoseq provides a comprehensive (de novo and genome-based) transcriptome with full-length transcripts and a range of isoforms Counts Number of genes 18,799 (30,562 in MCOT) Number of isoforms 61,086 Average number of isoforms/gene 3.24 N50 2.7 Kb Longest 9 Kb Shortest 100 bp
  12. 12. Evaluating the assembly Complete Fragmented Missing Diaci 1.1 74.8% 0.3% 24.9% Diaci 1.92 85.2% 0.1% 14.7% Overall alignment rate Concordant alignment rate Diaci 1.1 82% 0.62% Diaci 1.92 88% 60% Benchmarking sets of Universal Single-Copy Orthologs based on a set of 3350 single-copy orthologs from hemipteran species Paired-end RNAseq alignment MCOT Isoseq (full-length transcripts) Diaci 1.1 1054 bp 470 bp Diaci 1.92 1321 bp 699 bp Average length of aligned coding sequence NNN
  13. 13. Improved genome and annotation will expedite identification of targets for interdiction 13 Genome Pacbio v1.92 Annotation Isoseq Target for interdiction molecules Pathway Databases Expression Networks ……. Host Vector Pathogen
  14. 14. Thank you!! Utilizing system biology resources to decipher a tritrophic disease complex Prashant Hosmani Wednesday, 10:30 AM - 10:45 AM Member Symposium: Applying Emerging Genomic Techniques to Control Invasive Species