Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman


Published on

Flowering plant genomes are amongst the largest and most complex, caused by highly proliferative repetitive elements and frequent genome duplications. The sequencing revolution has now delivered over 30 plant genomes ranging from the 82 Mbp genome of floating bladderwort to the nearly 5 Gbp genome of diploid wheat. While a high quality reference genome is now a pivotal research tool in all crop improvement efforts, many projects emphasise delivery timeframes at the expense of genome quality.
Our species of interest is sugarcane (Saccharum hybrid) which possesses a highly aneupolyploid genome 10 Gbp in size. In line with international efforts, our group has contributed a range of approaches to elucidate the sugarcane genome sequence. The first of these has been an international BAC-by-BAC sequencing effort to determine a "monoploid" genome sequence for the genotype R570, in which we have assembled Illumina paired-read data for 465 BACs into one or a few contigs each. Secondly, we have applied second-generation whole-genome shotgun sequencing up to 45x to de novo assemble the genome of R570. Our preliminary assembly represents over two thirds of the expected genome size with a contig NG50 of 1200 bp. Finally, we are now progressing a third-generation sequencing approach to supplement the results of the short-read approach and progress towards a final hybrid assembly.
Without a robust approach structural and functional annotation cannot inform meaningful biological interpretation. As our work approaches completion, it is becoming clear that ultimately a hybrid approach combining all of these outputs will be required for a high quality reference genome for sugarcane. There is no single technology or approach to solve this problem. With an "out-of-the-box" approach nowhere in sight, assembling high quality genome sequences will likely remain an important problem for some time yet.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Establishing a hybrid approach to the polyploid sugarcane genome assembly - Paul Berkman

  1. 1. Establishing a hybrid approach to the polyploid sugarcane genome assembly Paul Berkman CSIRO PLANT INDUSTRY
  2. 2. R570 genome progress – CSIRO overview Research context BAC sequencing contribution  Assembly strategy  Progress and results Whole Genome Shotgun approach  Rationale  Approach and progress to date Future directions: there’s no silver bullet  Options for WGS assembly improvement  Hybrid approach for allele detection
  3. 3. CSIRO Objectives Generation of sugarcane genomic sequence  This will be done in conjunction with the International Sugarcane Genome Consortium using next generation sequencing technology Improvement of the genetic map of sugarcane (CSIRO)  For use as a tool in assembling genome sequence Development of a web-based platform to integrate the genomic sequence and genetic information (CSIRO)  This tool will enable researchers to identify regions of the genome that are associated with a particular trait.
  4. 4. International objectives R570 was designated as the reference at the Consortium meeting in August 2009  France, Australia, Brazil, USA, South Africa  WGS sequencing thought to be unlikely approach  R570 BAC library already in existence  A comprehensive genome map is already available  Some already commenced sequencing R570 BACs January 2013  Re-invigoration of the effort  Keygene’s whole genome profiling technology applied  Driving towards a monoploid genome sequence
  5. 5. Monoploid genome sequence of sugarcane 4/3/2014 Minimum tiling path of 5,029 BAC clones estimated to cover the monoploid genome sequence of sugarcane To date around 2000 BAC clones have been sequenced The rest will be sequenced in the following year.
  6. 6. Complex Genome sugarcane cultivar 10000 Mbp maize 5500 Mbp rice 800 Mbp sorghum 1600 Mbp Wheat 34000 Mbp recombinants S. spontaneum S. officinarum Arabidopsis 300 Mbp
  7. 7. Complex Genome sugarcane cultivar 10000 Mbp Wheat 34000 Mbp recombinants S. spontaneum S. officinarum Human 6000 Mbp
  8. 8. Our BAC strategy High-throughput NGS sequencing Towards the sugarcane genome | Karen Aitken Illumina HiSeq2000 • > 600 Gbp/run • > 35 Gbp/lane • 96 barcodes/lane • Better results than 454 raw assemblies < $500 USD/BAC for extraction, sequencing, and assembly • This could be reduced to < $200 USD using pooling
  9. 9. Developed in conjunction with  Prof. Dave Edwards  Mr Paul Visendi (PhD Student)  Under development since July 2012 Raw data cleaning  SOAP2 alignment to E. coli genome – Remove aligned reads from dataset  Vector sequence (pBeloBAC11) removal – BLAST alignment to BAC sequence – Extract aligned reads Our BAC strategy Custom assembly pipeline Short-read dataset SOAP2 against E. coli genome Filtered short-read dataset Vector sequence removal Vector & E. coli filtered short-read dataset
  10. 10. Assembly  Optimal coverage ~500-fold  SaSSY BAC-assembly algorithm – Developed at University of Queensland – Imelfort M.R., PhD Thesis, UQ 2012 – Graph-based OLC assembler – More robust than Eulerian approach – Uses paired-read statistics during contig building Scaffolding  SSPACE public tool, uses Bowtie  Collates alignment results and connects contigs Our BAC strategy Custom assembly pipeline Vector & E. coli filtered short-read dataset SaSSY Assembler SSPACE scaffolding Assembled BAC Scaffolds (FASTA)
  11. 11. Validation using BAC-end sequences  BLAST alignment of BES  Confirm alignment at ends of scaffold  If 2 scaffolds, combine based on BES position Validation using raw data  Apply public tool HAGFISH – NZPFR, Ross Crowhurst et al.  Uses distance between paired reads to: – Identify raw data gaps in assembly Our BAC strategy Custom assembly pipeline Assembled BAC Contigs (FASTA) BES alignment Raw data insert size analysis Validation results (CSV statistics & PNG image files)
  12. 12. BAC results to date Progress of Analysis • ~200 Gbp total sequence data generated for 453 BACs • Illumina HiSeq2000, 100bp paired-end reads • Data from BGI & AGRF • Validation component of pipeline under final development • Final assemblies of all 453 BACs to be completed April 2014 • Roughly half randomly extracted • Other half targeted at QTL of interest
  13. 13. BAC results to date Assembly of first 166 BACs 11 BACs did not contain an insert Total number of reads 1,583,129,457 Total volume sequence data (Mbp) 188,588.54 Bacterial sequence contamination 11.87% Total number assembled contigs 976 Total assembly Size (bp) 16,574,234 Total assembly N50 (bp) 33,885 Overall longest contig (bp) 128,410 Number contigs > 10 kbp 459 Number contigs > 100 kbp 10
  14. 14. Assemblies comparable to 454-based approaches  ~500-fold coverage appears optimal for Illumina assemblies  Combination of in-house and public tools required for best assembly More BACs can be sequenced at lower cost  BAC-pooling strategy in final development with UQ for future work SUGESI project is now driving towards a full monoploid assembly  No expressed intention to include genome-wide allelic information BAC results to date Summary
  15. 15. Why try?  Lots of data available quite cheaply now  Current BAC approach unlikely to provide much allelic information  Realistically, allelic information is critical Success of WGS approach in other species  Diploid wheat A and D genomes (2013)  Robust algorithms now handling complex datasets – Assemblathons – GAGE Why not try? R570 Whole Genome Shotgun Rationale
  16. 16. Illumina HiSeq2000 (Illumina)  90-100 bp read length  Paired reads – Fragment ends used to generate sequence – Fragment size (insert length) ranges from 180 bp to 32,000 bp  Very low error rate Pacific Biosciences (Pacbio)  Long reads, 1,000-20,000 bp long  Single unpaired reads  Moderate error rate R570 Whole Genome Shotgun Sequencing Approaches
  17. 17. R570 Whole Genome Shotgun Sequence Data Presentation title | Presenter name | Page 17 LIBRARY CURRENT SEQUENCE COVERAGE (x) CURRENT PHYSICAL COVERAGE (x) PROJECTED SEQUENCE COVERAGE (x) Illumina 180bp 12.14 10.93 12.14 Illumina 300bp 11.06 16.59 43.39 Illumina 600bp 8.70 26.10 8.70 Illumina 2,000bp 2.83 31.44 2.83 Illumina 2,500bp 1.32 16.50 18.65 Illumina 4,500bp 1.31 29.48 18.64 Illumina 5,000bp 2.84 78.89 2.84 Illumina 7,500bp 1.21 45.38 1.21 Illumina 32,000bp 0.37 58.61 0.37 Pacbio 20,000bp 0.00 0.00 2.7 TOTAL 41.78 313.92 111.47
  18. 18. Assembly  Commenced April 2013  4 High-Performance Computing facilities have been tested – Ixthus (192Gb), Barrine (1Gb), Cherax (4Tb), and Zythos (6Tb)  3 Assembly algorithms have been compared – Velvet (public), AllPaths-LG (public), Biokanga (CSIRO-developed)  ~20 test assemblies of 4x coverage dataset complete – Cumulative >600 days of compute time & max. 150Gb RAM used  ~20 attempts at assembling 42x coverage dataset – Cumulative >6000 days of compute time & max. 1.5Tb RAM used – Preliminary assembly completed March 2014 Recent development of Biokanga has optimised it to succeed R570 Whole Genome Shotgun Progress of Analysis
  19. 19. Preliminary assembly results  Biokanga successfully assembled 42x data on Ixthus (150Gb RAM, 10 days) Options to exploit this assembly:  Assembly of sugarcane mitochondrial genome  Extracting genomic context information for genes of interest R570 Whole Genome Shotgun Progress of Analysis Total number assembled scaffolds 5,344,024 Total assembly Size (bp) 6,761,247,399 Total assembly N50 (bp) 2,160 Overall longest contig (bp) 67,936 Number contigs > 1 kbp 2,223,020 Number contigs > 10 kbp 7,326
  20. 20. Future directions How do we get a sugarcane genome? • Assembling sugarcane is challenging • Perhaps a little more than I had anticipated • THERE’S NO SILVER BULLET! • BAC approach won’t provide information on alleles, but is tried and tested • WGS can provide allele-scale resolution, but is complex and challenging • Hybrid approach is most likely the best approach • Using SUGESI monoploid assembly as a template • Adopting WGS approach to overlay allelic data • Identify gaps in monoploid assembly using WGS assembly
  21. 21. Acknowledgements Team members  Karen Aitken  Paul Berkman  Jai Perroux  Jingchuan Li  Rosanne Casu  Anne Rae CSIRO and BSES Ltd sugarcane marker and breeding groups Hélène Bergès and her team at INRA, Toulouse, France DArT Pty Ltd Team Australian Genome Research Facility (AGRF) UQ colleagues  Paul Visendi  Dave Edwards
  22. 22. Thank you CSIRO Plant Industry Paul Berkman OCE Postdoctoral Fellow t +61 7 3214 2361 e w CSIRO PLANT INDUSTRY