Author
Adrián Báez Ortega
Supervisors
Marcos Colebrook Santamaría
José Luis Roda García
Date
17/07/2014
IonGAP
Contents
1. Introduction
2. Objective of the project
3. State of the art
4. The genome assembler
5. A genome assembly and analysis pipeline
6. IonGAP Web service
7. Parallel assembly of large genomes
8. Conclusions
IonGAP 1
DNA
Genomics
Genome Proteins
GenesDouble helix
Biomedicine Life
Introduction
IonGAP 2
Genome
sequencing
Genome
de novo assembly
Adapted from:
http://en.wikipedia.org/wiki/Genomic_library#mediaviewer/File:Whole_genome_shotgun_sequencing_versus_Hierarchical_shotgun_sequencing.png
Introduction
IonGAP 3
Introduction
Genomics
Instituto Universitario
de Enfermedades
Tropicales y Salud
Pública de Canarias
Computer
Science
Escuela Técnica
Superior de
Ingeniería InformáticaBioinformatics
IonGAP 4
Objective of the project
The development of an easy-to-use integrated software
platform that offers an optimally configured processing and
de novo assembly of genomic data obtained by Ion Torrent
sequencing, also complemented with several result analysis
stages.
IonGAP 5
Most sequencing
technologies:
Paired-end short reads
IUETSPC’s sequencing
technology:
Single-end long reads
DNA DNA
5’ 3’ 5’ 3’
Gap25-250 bp 25-250 bp 200-400 bp
Genome sequencing
Genome fragments FASTQ file
State of the art
IonGAP 6
Source:
http://gcat.davidson.edu/phast/img/contig.png
Genome assembly
State of the art
IonGAP 7
Genome assembly
• Genome assembler
– Overlap-layout-consensus (OLC) assemblers
– De Bruijn graph (DBG) assemblers
State of the art
IonGAP 8
Genome assembly
• Genome assembler
– Overlap-layout-consensus (OLC) assemblers
– De Bruijn graph (DBG) assemblers
Adapted from:
http://gcat.davidson.edu/phast
State of the art
IonGAP 9
Genome assembly
• Genome assembler
– Overlap-layout-consensus (OLC) assemblers
– De Bruijn graph (DBG) assemblers
State of the art
IonGAP 1
0
Source:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2874646
State of the art
IonGAP 1
1
Data preprocessing
• Removing adapters
• Quality control
State of the art
IonGAP 12
Data preprocessing
• Quality control
State of the art
IonGAP 13
Genome finishing
• Scaffolding
• Correction of assembly errors
– Discrepancies with reads or reference genome
– Repeat correction
State of the art
IonGAP 14
Genome finishing
• Scaffolding
• Correction of assembly errors
– Discrepancies with reads or reference genome
– Repeat correction
State of the art
IonGAP 15
Genome finishing
• Scaffolding
• Correction of assembly errors
– Discrepancies with reads or reference genome
– Repeat correction
State of the art
IonGAP 16
The genome assembler
IonGAP 17
Data preprocessing
Genome
assembly
Genome finishing
Genome analysis
The genome assembler
Data set
Streptococcus
agalactiae
(686,800 reads)
IonGAP 18
Source:
http://ngm.nationalgeographic.com/wallpaper/img/2013/01/08-streptococcus_1600.jpg
The genome assembler
Comparative study of assemblers
• OLC assemblers
– MIRA
– Celera Assembler
– SGA
IonGAP 19
• DBG assemblers
– ABySS
– Ray
– Velvet
– SparseAssembler
– Minia
Results
• Number of contigs ≥ 500 bp
• N50 length
Conclusions
• MIRA is the most suitable assembler
• DBG is not indicated for long-read assembly
The genome assembler
IonGAP 20
Results
• Number of contigs ≥ 500 bp
• N50 length
Conclusions
• MIRA is the most suitable assembler
• DBG is not indicated for long-read assembly
50% of the genome is in contigs larger than N50
Source:
http://schatzlab.cshl.edu/teaching/2012/CSHL.Sequencing/Whole%20Genome%20Assembly%20and%20Alignment.pdf
The genome assembler
IonGAP 21
Results
• Number of contigs ≥ 500 bp
• N50 length
Conclusions
• MIRA is the most suitable assembler
• DBG is not indicated for long-read assembly
The genome assembler
IonGAP 22
Results
• Number of contigs ≥ 500 bp
• N50 length
Conclusions
• MIRA is the most suitable assembler
• DBG is not indicated for long-read assembly
1
The genome assembler
IonGAP 23
Results
• Number of contigs ≥ 500 bp
• N50 length
Conclusions
• MIRA is the most suitable assembler
• DBG is not indicated for long-read assembly
The genome assembler
IonGAP 24
MIRA assembler
The genome assembler
IonGAP 25
1
Automatic
editing
Data
preprocessing
Fast read
comparison
Smith-Waterman
alignment
Contig
assembly
Finished
project
Assembly parameter optimization
• Number of assembly iterations
• Uniform read distribution
• Separation of long repeats in
different contigs
• Maximum number of times a contig
can be rebuilt during an iteration
• Minimum number of reads
per contig
Conclusion
The assembler is set by default in its optimal configuration
• Minimum size of a contig for
being considered as "large"
• Minimum read length
• Minimum repeat length
• Minimum overlap length
• Minimum overlap score
The genome assembler
IonGAP 26
Minimum size of a contig for
being considered as "large"
A genome assembly and analysis pipeline
IonGAP 27
Data preprocessing
Genome
assembly
Genome finishing
Genome analysis
aagttttggaaccattcgaaacagcacagctctaaaacttaccgattagaacatcatcta
aggtaatcgttttggaaccattcgaaacagcacagctctaaaactatcgctcaagcattc
gtatttgttttggagttttggaaccattcgaaacagcacagctctaaaacaacatttaac
tcataactatcatttagagtgttttggaaccattcgaaacagcacagctctaaaactaag
taacaagacagacttgaaactgttaagttttggaaccattcgaaacagcacagctctaaa
acttaccgattagaacatcatctaaggtaatcgttttggaaccattcgaaacagcacagc
tctaaaactatcgctcaagcattcgtatttgttttggagttttggaaccattcgaaacag
cacagctctaaaacatttccagtaagttcaaatttaacaaatgtgttttggaaccattcg
aaacagcacagctctaaaacagttttaacattaaatcacgtcttaaataagttttggaac
cattcgaaacagcacagctctaaaactaccgcaataagatcaccaatgttgtttgagttt
tggaaccattcgaaacagcacagctctaaaacgctattagtggaaacttttgaacgttat
gtgttttggaaccattcgaaacagcacagctctaaaacgaacaagatgtagatatgaaat
taacatttgttttggaaccattcgaaacagcacagctctaaaacctccaagtgctttaaa
gtcatttattttttgttttggaaccattcgaaacagcacagctctaaaacccatcatcaa
cctgaatgactccacatttcgttttggaaccattcgaaacagcacagctctaaaacgacc
cttatcaaacccaagcagaagtaactgttttggaaccattcgaaacagcacagctctaaa
acgatggtcgagcacttagaaaaccaataaaagttttggaaccattcgaaacagcacagc
tctaaaacgcttgtttcgctgtcgctcttgtttgacgggttttggaaccattcgaaacag
cacagctctaaaacaagcacaagaagcaactgttagaagacatagttttggaaccattcg
aaacagcacagctctaaaacacagctgaagagttagaaaaggctaatgttgttttggaac
cattcgaaacagcacagctctaaaacacatgacctgctgaacctgtccaccatatcgttt
tggaaccattcgaaacagcacagctctaaaactctgagatgagaacatatacttattctt
ttgttttggaaccattcgaaacagcacagctctaaaactctgagatgagaacatatactt
attcttttgttttggaaccattcgaaacagcacagctctaaaacctcgtagaaaattttc
ttttgagctttcgtaatcgcgccattcgtctcagcaggacttcagtttcgatgattcctt
gttattactgtgcttttactaatattataccatattttcgcctatcaagaaataatcctt
atcaataacatattgcggtaaatcatagagtcttctaggttctagaaagagtactgactt
ttgcattaaattgatgtattcacataattttataacttcatctttggtaagataagctcc
gctattaacaaaaaccaagagattctttttcgttaaataatggtaaacttgtataatttc
aaaacatttttcaaagatagtgtcgctctgtgtctcaattttgactcccagtgccttaat
gagttctaaaatcgtaatttcatcgtattctaaatcaagctcattctctagacactcaaa
gene cas2
inference ab initio prediction:Prodigal:2.60
inference similar to AA sequence:UniProtKB:G3ECR3
locus_tag Sagalactiae_00003
product CRISPR-associated endoribonuclease Cas2
protein_id gnl|Prokka|Sagalactiae_00003
Contig name Subject name Score % Identity
Sagalactiae_c8 Streptococcus agalactiae 2603V/R strain 2603V/R 16S ribosomal RNA, complete sequence 2846 100.00
Sagalactiae_c8 Streptococcus agalactiae ATCC 13813 strain JCM 5671 16S ribosomal RNA, complete sequence 2772 100.00
Sagalactiae_c10 Streptococcus agalactiae 2603V/R strain 2603V/R 16S ribosomal RNA, complete sequence 2846 100.00
Sagalactiae_c10 Streptococcus agalactiae ATCC 13813 strain JCM 5671 16S ribosomal RNA, complete sequence 2772 100.00
A genome assembly and analysis pipeline
IonGAP 28
A genome assembly and analysis pipeline
IonGAP 29
Genome assembly
Data
preprocessing
Genome finishing
Genome analysis
Data preprocessing
• Comparative study of trimmers
(PRINSEQ, ERNE-filter, Trimmomatic)
– Removing adapters → 5’ trimming
– Discarding useless reads → Minimum length
– Removing low-quality regions
• Internal quality control of MIRA
– Sliding window trimming
Maximum length
Sliding window trimming
Window length
Quality threshold
A genome assembly and analysis pipeline
IonGAP 30
A genome assembly and analysis pipeline
Data preprocessing
Mauve Assembly Metrics
IonGAP 31
Data preprocessing
Conclusion
Read preprocessing has negative effects on the assembly
• An extensive evaluation of read trimming effects on Illumina NGS data analysis
(Del Fabbro C, Scalabrin S, Morgante M, Giorgi FM. PLoS ONE 2013):
"For high quality values, trimmed datasets produce slightly more fragmented
assemblies, probably due to a more stringent trimming that reflects also on
lower computational needs."
• MIRA user manual (Chevreux B):
"For heavens' sake: do NOT try to clip or trim by quality yourself. Do NOT try to
remove standard sequencing adaptors yourself. Just leave the data alone!"
A genome assembly and analysis pipeline
IonGAP 32
A genome assembly and analysis pipeline
IonGAP 33
Data preprocessing
Genome
finishing
Genome assembly
Genome analysis
Genome finishing
• Scaffolding
– Impossible: no mate-pair reads
• Correction of assembly errors
– Simplifier: selective elimination of redundant
sequences
A genome assembly and analysis pipeline
IonGAP 34
Genome finishing
Simplifier
• Only eliminates complete redundant contigs
• Time expensive
• Natural repeats in genome → Risky
Conclusion
It is better to leave postprocessing in the user's hands
A genome assembly and analysis pipeline
IonGAP 35
A genome assembly and analysis pipeline
IonGAP 36
Data preprocessing
Genome
analysis
Genome assembly
Genome finishing
Genome analysis
• Quality analysis of reads and contigs (FastQC)
• Taxonomic classification (BLAST)
• Genome annotation (Prokka)
If reference sequence provided:
• Genome alignment and coverage analysis
(MUMmer, Circos, BLAST, Circoletto, Mauve, genoPlotR)
• Contig reordering (Mauve)
A genome assembly and analysis pipeline
IonGAP 37
Genome analysis
• Taxonomic classification (BLAST)
• Genome annotation (Prokka)
A genome assembly and analysis pipeline
IonGAP 38
Genome analysis
• Genome annotation (Prokka)
UGENE genome viewer
A genome assembly and analysis pipeline
IonGAP 39
Genome analysis
If reference sequence provided:
• Genome alignment and coverage analysis
(MUMmer, Circos, BLAST, Circoletto, Mauve, genoPlotR)
A genome assembly and analysis pipeline
IonGAP 40
Generated by
Circos, BLAST
and Circoletto
A genome assembly and analysis pipeline
IonGAP 41
Genome analysis
If reference sequence provided:
• Contig reordering (Mauve)
A genome assembly and analysis pipeline
IonGAP 42
Mauve genome viewer
Genome analysis
If reference sequence provided:
• Contig reordering (Mauve)
A genome assembly and analysis pipeline
IonGAP 43
Mauve genome viewer
Functioning and implementation
• Web user interface
• Input Web form
• Two independent modules (daemons)
– Assembly module
– Analysis module
• User notification via email
IonGAP Web service
IonGAP 44
Functioning and implementation
• Hosting: ETSII’s Computing Center
– Virtual machine (Ubuntu 12.04)
– Dual core 64 bits processor
– 17 GB RAM
IonGAP Web service
IonGAP 45
IonGAP Web service
IonGAP 46
IonGAP Web service
IonGAP 47
Web service demo
IonGAP | an integrated Genome Assembly Platform
for Ion Torrent data
IonGAP Web service
IonGAP 48
(http://193.145.101.223/)
Genome assembly with IonGAP
Trypanosoma cruzi
• Extremely repetitive genome
• Data explosion
• Data filtering: 900 MB = 1,500,000 reads
IonGAP Web service
IonGAP 49
Parallel assembly of large genomes
Parallel genome assembly
• Parallel computing: Computer cluster
• Contrail
– Parallel assembly on Hadoop
• ETSII’s Computing Center
– Cluster of 108 computers
– Hadoop installation
IonGAP 50
Parallel assembly of large genomes
Parallel assembly with Contrail
IonGAP 51
Parallel assembly with Contrail
Conclusions
• Good performance
– Parallel computing is the future of assembly
• Bad results
– Contrail uses DBG → Not suitable for long reads
Parallel assembly of large genomes
IonGAP 52
• IonGAP solves the need for an automated tool for
the assembly and preliminary analysis of Ion
Torrent data suffered by IUETSPC
• Availability to the scientific community is
directed to stimulate low-cost genome research and
development of other customized solutions
• The S. agalactiae genome has been successfully
assembled, and a manuscript is been prepared for
publication in a scientific journal
Conclusions
IonGAP 53
Future work
• New options and features
• Cloud assembly with Amazon Web Services
• Parallel OLC assembly on Hadoop
• High performance computing
– ITER’s Teide HPC – September 2014
Conclusions
IonGAP 54
Conclusions
Multidisciplinary work is the way to tackle the new
science of the 21st century
IonGAP 55
Genomics
Instituto Universitario
de Enfermedades
Tropicales y Salud
Pública de Canarias
Computer
Science
Escuela Técnica
Superior de
Ingeniería Informática
Bioinformatics
Many thanks
for your
attention
IonGAP 56

IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data