Your SlideShare is downloading. ×
0
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data

1,071

Published on

http://iongap.hpc.iter.es …

http://iongap.hpc.iter.es

Computer Engineer Degree Final Project.
Universidad de La Laguna, Spain, July 2014.

Ion Torrent technology allows genome sequencing with reduced costs; however, its major drawback is the lack of tools dedicated to processing and assembling Ion Torrent reads.
IonGAP is a free graphical integrated pipeline designed for the assembly and subsequent analysis of Ion Torrent sequencing data. Both its components and their configuration are based on a research process aimed to discover the optimal combination of tools for obtaining good results from single-end reads generated by the Ion Torrent PGM sequencer, mainly from bacterial genomic material.

Published in: Software
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,071
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
17
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  1. Author Adrián Báez Ortega Supervisors Marcos Colebrook Santamaría José Luis Roda García Date 17/07/2014 IonGAP
  2. Contents 1. Introduction 2. Objective of the project 3. State of the art 4. The genome assembler 5. A genome assembly and analysis pipeline 6. IonGAP Web service 7. Parallel assembly of large genomes 8. Conclusions IonGAP 1
  3. DNA Genomics Genome Proteins GenesDouble helix Biomedicine Life Introduction IonGAP 2
  4. Genome sequencing Genome de novo assembly Adapted from: http://en.wikipedia.org/wiki/Genomic_library#mediaviewer/File:Whole_genome_shotgun_sequencing_versus_Hierarchical_shotgun_sequencing.png Introduction IonGAP 3
  5. Introduction Genomics Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias Computer Science Escuela Técnica Superior de Ingeniería InformáticaBioinformatics IonGAP 4
  6. Objective of the project The development of an easy-to-use integrated software platform that offers an optimally configured processing and de novo assembly of genomic data obtained by Ion Torrent sequencing, also complemented with several result analysis stages. IonGAP 5
  7. Most sequencing technologies: Paired-end short reads IUETSPC’s sequencing technology: Single-end long reads DNA DNA 5’ 3’ 5’ 3’ Gap25-250 bp 25-250 bp 200-400 bp Genome sequencing Genome fragments FASTQ file State of the art IonGAP 6
  8. Source: http://gcat.davidson.edu/phast/img/contig.png Genome assembly State of the art IonGAP 7
  9. Genome assembly • Genome assembler – Overlap-layout-consensus (OLC) assemblers – De Bruijn graph (DBG) assemblers State of the art IonGAP 8
  10. Genome assembly • Genome assembler – Overlap-layout-consensus (OLC) assemblers – De Bruijn graph (DBG) assemblers Adapted from: http://gcat.davidson.edu/phast State of the art IonGAP 9
  11. Genome assembly • Genome assembler – Overlap-layout-consensus (OLC) assemblers – De Bruijn graph (DBG) assemblers State of the art IonGAP 1 0
  12. Source: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2874646 State of the art IonGAP 1 1
  13. Data preprocessing • Removing adapters • Quality control State of the art IonGAP 12
  14. Data preprocessing • Quality control State of the art IonGAP 13
  15. Genome finishing • Scaffolding • Correction of assembly errors – Discrepancies with reads or reference genome – Repeat correction State of the art IonGAP 14
  16. Genome finishing • Scaffolding • Correction of assembly errors – Discrepancies with reads or reference genome – Repeat correction State of the art IonGAP 15
  17. Genome finishing • Scaffolding • Correction of assembly errors – Discrepancies with reads or reference genome – Repeat correction State of the art IonGAP 16
  18. The genome assembler IonGAP 17 Data preprocessing Genome assembly Genome finishing Genome analysis
  19. The genome assembler Data set Streptococcus agalactiae (686,800 reads) IonGAP 18 Source: http://ngm.nationalgeographic.com/wallpaper/img/2013/01/08-streptococcus_1600.jpg
  20. The genome assembler Comparative study of assemblers • OLC assemblers – MIRA – Celera Assembler – SGA IonGAP 19 • DBG assemblers – ABySS – Ray – Velvet – SparseAssembler – Minia
  21. Results • Number of contigs ≥ 500 bp • N50 length Conclusions • MIRA is the most suitable assembler • DBG is not indicated for long-read assembly The genome assembler IonGAP 20
  22. Results • Number of contigs ≥ 500 bp • N50 length Conclusions • MIRA is the most suitable assembler • DBG is not indicated for long-read assembly 50% of the genome is in contigs larger than N50 Source: http://schatzlab.cshl.edu/teaching/2012/CSHL.Sequencing/Whole%20Genome%20Assembly%20and%20Alignment.pdf The genome assembler IonGAP 21
  23. Results • Number of contigs ≥ 500 bp • N50 length Conclusions • MIRA is the most suitable assembler • DBG is not indicated for long-read assembly The genome assembler IonGAP 22
  24. Results • Number of contigs ≥ 500 bp • N50 length Conclusions • MIRA is the most suitable assembler • DBG is not indicated for long-read assembly 1 The genome assembler IonGAP 23
  25. Results • Number of contigs ≥ 500 bp • N50 length Conclusions • MIRA is the most suitable assembler • DBG is not indicated for long-read assembly The genome assembler IonGAP 24
  26. MIRA assembler The genome assembler IonGAP 25 1 Automatic editing Data preprocessing Fast read comparison Smith-Waterman alignment Contig assembly Finished project
  27. Assembly parameter optimization • Number of assembly iterations • Uniform read distribution • Separation of long repeats in different contigs • Maximum number of times a contig can be rebuilt during an iteration • Minimum number of reads per contig Conclusion The assembler is set by default in its optimal configuration • Minimum size of a contig for being considered as "large" • Minimum read length • Minimum repeat length • Minimum overlap length • Minimum overlap score The genome assembler IonGAP 26 Minimum size of a contig for being considered as "large"
  28. A genome assembly and analysis pipeline IonGAP 27 Data preprocessing Genome assembly Genome finishing Genome analysis
  29. aagttttggaaccattcgaaacagcacagctctaaaacttaccgattagaacatcatcta aggtaatcgttttggaaccattcgaaacagcacagctctaaaactatcgctcaagcattc gtatttgttttggagttttggaaccattcgaaacagcacagctctaaaacaacatttaac tcataactatcatttagagtgttttggaaccattcgaaacagcacagctctaaaactaag taacaagacagacttgaaactgttaagttttggaaccattcgaaacagcacagctctaaa acttaccgattagaacatcatctaaggtaatcgttttggaaccattcgaaacagcacagc tctaaaactatcgctcaagcattcgtatttgttttggagttttggaaccattcgaaacag cacagctctaaaacatttccagtaagttcaaatttaacaaatgtgttttggaaccattcg aaacagcacagctctaaaacagttttaacattaaatcacgtcttaaataagttttggaac cattcgaaacagcacagctctaaaactaccgcaataagatcaccaatgttgtttgagttt tggaaccattcgaaacagcacagctctaaaacgctattagtggaaacttttgaacgttat gtgttttggaaccattcgaaacagcacagctctaaaacgaacaagatgtagatatgaaat taacatttgttttggaaccattcgaaacagcacagctctaaaacctccaagtgctttaaa gtcatttattttttgttttggaaccattcgaaacagcacagctctaaaacccatcatcaa cctgaatgactccacatttcgttttggaaccattcgaaacagcacagctctaaaacgacc cttatcaaacccaagcagaagtaactgttttggaaccattcgaaacagcacagctctaaa acgatggtcgagcacttagaaaaccaataaaagttttggaaccattcgaaacagcacagc tctaaaacgcttgtttcgctgtcgctcttgtttgacgggttttggaaccattcgaaacag cacagctctaaaacaagcacaagaagcaactgttagaagacatagttttggaaccattcg aaacagcacagctctaaaacacagctgaagagttagaaaaggctaatgttgttttggaac cattcgaaacagcacagctctaaaacacatgacctgctgaacctgtccaccatatcgttt tggaaccattcgaaacagcacagctctaaaactctgagatgagaacatatacttattctt ttgttttggaaccattcgaaacagcacagctctaaaactctgagatgagaacatatactt attcttttgttttggaaccattcgaaacagcacagctctaaaacctcgtagaaaattttc ttttgagctttcgtaatcgcgccattcgtctcagcaggacttcagtttcgatgattcctt gttattactgtgcttttactaatattataccatattttcgcctatcaagaaataatcctt atcaataacatattgcggtaaatcatagagtcttctaggttctagaaagagtactgactt ttgcattaaattgatgtattcacataattttataacttcatctttggtaagataagctcc gctattaacaaaaaccaagagattctttttcgttaaataatggtaaacttgtataatttc aaaacatttttcaaagatagtgtcgctctgtgtctcaattttgactcccagtgccttaat gagttctaaaatcgtaatttcatcgtattctaaatcaagctcattctctagacactcaaa gene cas2 inference ab initio prediction:Prodigal:2.60 inference similar to AA sequence:UniProtKB:G3ECR3 locus_tag Sagalactiae_00003 product CRISPR-associated endoribonuclease Cas2 protein_id gnl|Prokka|Sagalactiae_00003 Contig name Subject name Score % Identity Sagalactiae_c8 Streptococcus agalactiae 2603V/R strain 2603V/R 16S ribosomal RNA, complete sequence 2846 100.00 Sagalactiae_c8 Streptococcus agalactiae ATCC 13813 strain JCM 5671 16S ribosomal RNA, complete sequence 2772 100.00 Sagalactiae_c10 Streptococcus agalactiae 2603V/R strain 2603V/R 16S ribosomal RNA, complete sequence 2846 100.00 Sagalactiae_c10 Streptococcus agalactiae ATCC 13813 strain JCM 5671 16S ribosomal RNA, complete sequence 2772 100.00 A genome assembly and analysis pipeline IonGAP 28
  30. A genome assembly and analysis pipeline IonGAP 29 Genome assembly Data preprocessing Genome finishing Genome analysis
  31. Data preprocessing • Comparative study of trimmers (PRINSEQ, ERNE-filter, Trimmomatic) – Removing adapters → 5’ trimming – Discarding useless reads → Minimum length – Removing low-quality regions • Internal quality control of MIRA – Sliding window trimming Maximum length Sliding window trimming Window length Quality threshold A genome assembly and analysis pipeline IonGAP 30
  32. A genome assembly and analysis pipeline Data preprocessing Mauve Assembly Metrics IonGAP 31
  33. Data preprocessing Conclusion Read preprocessing has negative effects on the assembly • An extensive evaluation of read trimming effects on Illumina NGS data analysis (Del Fabbro C, Scalabrin S, Morgante M, Giorgi FM. PLoS ONE 2013): "For high quality values, trimmed datasets produce slightly more fragmented assemblies, probably due to a more stringent trimming that reflects also on lower computational needs." • MIRA user manual (Chevreux B): "For heavens' sake: do NOT try to clip or trim by quality yourself. Do NOT try to remove standard sequencing adaptors yourself. Just leave the data alone!" A genome assembly and analysis pipeline IonGAP 32
  34. A genome assembly and analysis pipeline IonGAP 33 Data preprocessing Genome finishing Genome assembly Genome analysis
  35. Genome finishing • Scaffolding – Impossible: no mate-pair reads • Correction of assembly errors – Simplifier: selective elimination of redundant sequences A genome assembly and analysis pipeline IonGAP 34
  36. Genome finishing Simplifier • Only eliminates complete redundant contigs • Time expensive • Natural repeats in genome → Risky Conclusion It is better to leave postprocessing in the user's hands A genome assembly and analysis pipeline IonGAP 35
  37. A genome assembly and analysis pipeline IonGAP 36 Data preprocessing Genome analysis Genome assembly Genome finishing
  38. Genome analysis • Quality analysis of reads and contigs (FastQC) • Taxonomic classification (BLAST) • Genome annotation (Prokka) If reference sequence provided: • Genome alignment and coverage analysis (MUMmer, Circos, BLAST, Circoletto, Mauve, genoPlotR) • Contig reordering (Mauve) A genome assembly and analysis pipeline IonGAP 37
  39. Genome analysis • Taxonomic classification (BLAST) • Genome annotation (Prokka) A genome assembly and analysis pipeline IonGAP 38
  40. Genome analysis • Genome annotation (Prokka) UGENE genome viewer A genome assembly and analysis pipeline IonGAP 39
  41. Genome analysis If reference sequence provided: • Genome alignment and coverage analysis (MUMmer, Circos, BLAST, Circoletto, Mauve, genoPlotR) A genome assembly and analysis pipeline IonGAP 40
  42. Generated by Circos, BLAST and Circoletto A genome assembly and analysis pipeline IonGAP 41
  43. Genome analysis If reference sequence provided: • Contig reordering (Mauve) A genome assembly and analysis pipeline IonGAP 42 Mauve genome viewer
  44. Genome analysis If reference sequence provided: • Contig reordering (Mauve) A genome assembly and analysis pipeline IonGAP 43 Mauve genome viewer
  45. Functioning and implementation • Web user interface • Input Web form • Two independent modules (daemons) – Assembly module – Analysis module • User notification via email IonGAP Web service IonGAP 44
  46. Functioning and implementation • Hosting: ETSII’s Computing Center – Virtual machine (Ubuntu 12.04) – Dual core 64 bits processor – 17 GB RAM IonGAP Web service IonGAP 45
  47. IonGAP Web service IonGAP 46
  48. IonGAP Web service IonGAP 47
  49. Web service demo IonGAP | an integrated Genome Assembly Platform for Ion Torrent data IonGAP Web service IonGAP 48 (http://193.145.101.223/)
  50. Genome assembly with IonGAP Trypanosoma cruzi • Extremely repetitive genome • Data explosion • Data filtering: 900 MB = 1,500,000 reads IonGAP Web service IonGAP 49
  51. Parallel assembly of large genomes Parallel genome assembly • Parallel computing: Computer cluster • Contrail – Parallel assembly on Hadoop • ETSII’s Computing Center – Cluster of 108 computers – Hadoop installation IonGAP 50
  52. Parallel assembly of large genomes Parallel assembly with Contrail IonGAP 51
  53. Parallel assembly with Contrail Conclusions • Good performance – Parallel computing is the future of assembly • Bad results – Contrail uses DBG → Not suitable for long reads Parallel assembly of large genomes IonGAP 52
  54. • IonGAP solves the need for an automated tool for the assembly and preliminary analysis of Ion Torrent data suffered by IUETSPC • Availability to the scientific community is directed to stimulate low-cost genome research and development of other customized solutions • The S. agalactiae genome has been successfully assembled, and a manuscript is been prepared for publication in a scientific journal Conclusions IonGAP 53
  55. Future work • New options and features • Cloud assembly with Amazon Web Services • Parallel OLC assembly on Hadoop • High performance computing – ITER’s Teide HPC – September 2014 Conclusions IonGAP 54
  56. Conclusions Multidisciplinary work is the way to tackle the new science of the 21st century IonGAP 55 Genomics Instituto Universitario de Enfermedades Tropicales y Salud Pública de Canarias Computer Science Escuela Técnica Superior de Ingeniería Informática Bioinformatics
  57. Many thanks for your attention IonGAP 56

×