How to Standardise and Assemble Raw Data into Sequences:

What Does it Mean for a Laboratory to Use Such Technologies?"
Dr Joseph Hughes!
!
!11th OIE Seminar!
Saskatoon - 17th June 2015!
Decreasing sequencing
cost!
$0.01
$0.10
$1.00
$10.00
$100.00
$1,000.00
$10,000.00
Jul-98 Apr-01 Jan-04 Oct-06 Jul-09 Apr-12Dec-14Sep-17
Cost per raw Megabase of DNA sequence!
http://www.genome.gov/
sequencingcosts!
Democratization of
sequencing!
http://omicsmaps.com!
Applications of High throughput
sequencing"
•  Whole genome sequencing!
•  Genome variability within a host!
•  De-novo assembly of novel viruses!
•  Metagenomics of communities!
Considerations for a genome
assembly pipeline
•  Flexible pipeline: Handling unknown genotypes or virus
samples!
•  Platform independent: work with data from different
platforms!
•  Virus independent: work on any virus!
•  Scalable to hundreds or thousands of samples!
•  Accuracy of SNP calling in the genome (outbreak analysis
where samples are more closely related)!
Known reference" Unknown reference"
Pre-assembly "
Processing"
Check format (sff, fastq) !
Convert to FASTQ!
Remove adaptor contaminants!
Remove host genome contamination!
Quality & length trimming!
Reference assembly!
De-novo assembly!
Contig merge!
Scaffolding contigs!
Validation!
Consensus!
Variant calling!
Classification!
Assembly"
Post-assembly processing"
Annotation!
Genome comparison!
Examples
1.  1999-2001 in Northern Italy:
emergence of highly pathogenic
avian influenza H7N1!
•  Identify known molecular markers for viral
pathogenicity in intra-host viral populations!
•  OIE & FAO reference lab for Influenza!
2.  2010 in the Netherlands: die-off of
>1000 wild water frogs and newts!
•  Isolation, characterisation and relationship to
known viruses of the Dutch frog killer!
•  Van Beurden et al. (2014). Genome Announc.!
hybrid Edible frog !
(Pelophylax kl. esculentus)!
Example 1:

Characterization of HPAI signature
mutations"
Monne et al. (2014). Journal of Virology!
Pre-assembly processing"
trim_galore and
FastQC for quality
control!
Reference assemblers?"
•  Hash based tools: Mosaik, Novoalign, Stampy, Tanoti!
•  Borrrows-Wheeler Transform-based tools: BWA, Bowtie2,
NextGenMap!
Too many to choose from!
http://www.bioinformatics.cvr.ac.uk/Tanoti!
HA
position
log10(DOC)
0.0
0.5
1.0
1.5
2.0
2.5
500 1000 1500
M
position
log10(DOC)
0
1
2
3
4
200 400 600 800 1000
NA
position
log10(DOC)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
200 400 600 800 1000 1200
NP
position
log10(DOC)
0
1
2
3
500 1000 1500
NS
position
log10(DOC)
0
1
2
3
4
200 400 600 800
PA
position
log10(DOC)
0.0
0.5
1.0
1.5
2.0
2.5
500 1000 1500 2000
PB1
position
log10(DOC)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
500 1000 1500 2000
PB2
position
log10(DOC)
0
1
2
3
500 1000 1500 2000
Bowtie2 and Stampy !
Tanoti!
!
Tablet - assembly
Variant calling – detecting true
mutations"
•  Many tools LoFreq, Vphaser, DiversiTools!
•  Using replicates to validate mutations (e.g. FMDV
experiments)!
!
One LPAI sample collected after the identification of HPAI
with an HA cleavage site and multiple HPAI associated
mutations at extremely low frequency!
PB2_I398T
PB1_D154G
PB1_G216S
PB1_E745K
PA_T61I
PA_K115N
PA_K252E
HA_A130T
HA_T146A
HA_E228A
HA_T454A
HA_R554K
NP_A349T
NP_N376S
NA_K173R
M1_A166V
NS1_I136V
NS1_N139D
NS1_-225R
X4756.99
X4827.99
X4828.99
X4911.99
X4708.99
X4618.99
X4618.99.1
X4749.99
PB2_I398T
PB1_D154G
PB1_G216S
PB1_E745K
PA_T61I
PA_K115N
PA_K252E
HA_A130T
HA_T146A
HA_E228A
HA_T454A
HA_R554K
NP_A349T
NP_N376S
NA_K173R
M1_A166V
NS1_I136V
NS1_N139D
NS1_-225R
X4295.99
X3675.99
X4829.99
X1744.99
X2732.99
X3283.99
Frequency of LPAI in HPAI samples ! Frequency of HPAI in LPAI samples !
Amino acid changes!
Samples!
Amino acid changes!
Example 2:

Isolation and Sequencing"
•  From dead wild water frog in September 2013!
•  Suspension from pooled internal organs!
•  Inoculated on BF-2 cells (Bluegill Fry cells fibroblast)!
•  DNA extracted using Dneasy kit (viral purity of 67%!
•  DNA sheared by sonication!
•  KAPA library preparation!
•  MiSeq (Illumina) Machine #2 test run: total run 26,700,000
reads including 50% PhiX (16Gb)!
•  13,127,123 paired-end 300 bp reads from the sample (7.9
Gb)!
Assembly"
•  Abyss-pe de-novo assembler reconstructed the full-
genome in a single contig of 107,260!
•  5 different regions had ambiguous/repetitive sequences !
•  Re-sequencing ambiguous regions with Sanger!
1!
1692!
1693!
21168!
21359!
38364!
38387!
66887!
67100!
73322!
73434!
107260!
?! ?! ?! ?! ?!
Finishing assembly"
•  CodonCode Aligner for assembling and checking the
Sanger sequences!
•  SequencePatcher.pl to stitch the Sanger sequences into
the de-novo contig!
•  iCORN2!
•  Final genome of 107,260 => 107,772bp!
Annotating
•  BLAST to find the most similar annotated genome!
•  Common Midwife Toad Virus (CMTV) from Spain!
•  Transfer of annotations from CMTV to the full genome
(RATT)!
•  Identifies inappropriate start codons, frame-shifts!
•  Correcting of transferred models using Artemis!
20 kb
RGV JQ654586
STIV EU627010
FV3 KJ175144
FV3 AY548484
TFV AF389451
CGSIV KF512820
ADRV KF033124
ADRV KC865735
CMTV NL
CMTV JQ231222
ATV AY150217
EHNV FJ433873
ESV JQ724856
84!
95!
100!
100!
76!
100!
100!
100!
Standard formats"
•  FASTQ – quality score depends on the technology and
base caller!
!
•  SAM – soon v1.5 extensions!
Genome standards – 5 categories!
Ladner et al.(2014) mBio !
% genome!
covered!
!
>50%!
!
!
~80-90%!
!
!
~90-99%!
!
!
100%!
!
!
100%!
!
HTS!
coverage!
!
!
!
!
~15-30 x!
!
!
>100 x!
!
!
RACE!
!
~ 400 !
– 1000 x!
!
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
0
1
10
100
1,000
10,000
100,000
1,000,000
0.1
1
10
100
1000
10,000
100,000
1,000,000
10,000,000
100,000,000
Year
Diskstorage(Mbytes/$)
DNAsequencing(bp/$)
Hard disk storage (MB/$)
Doubling time 14 months
Pre-NGS (bp/$)
Doubling time 19 months
-
NGS (bp/$)
Doubling time 5 months
http://genomebiology.com/2010/11/5/207!
Challenges: Rates of increase in data"
Challenges: resources and
technologies"
•  Shift towards more data, labs need to have dedicated
bioinformaticians!
•  Rule of thumb: invest as much in computers and data
scientists as in sequencing equipment and lab
technicians!
•  Non-uniform coverage, repeat regions, systematic biases,
PCR errors, sequencing errors, sequence length!
CVR bioinformatics team!
Director of OIE Collaborating Centre
for Viral Genomics and Bioinformatics!
Director of Centre for Virus Research!

How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such Technologies?

  • 1.
    How to Standardiseand Assemble Raw Data into Sequences:
 What Does it Mean for a Laboratory to Use Such Technologies?" Dr Joseph Hughes! ! !11th OIE Seminar! Saskatoon - 17th June 2015!
  • 2.
    Decreasing sequencing cost! $0.01 $0.10 $1.00 $10.00 $100.00 $1,000.00 $10,000.00 Jul-98 Apr-01Jan-04 Oct-06 Jul-09 Apr-12Dec-14Sep-17 Cost per raw Megabase of DNA sequence! http://www.genome.gov/ sequencingcosts! Democratization of sequencing! http://omicsmaps.com!
  • 3.
    Applications of Highthroughput sequencing" •  Whole genome sequencing! •  Genome variability within a host! •  De-novo assembly of novel viruses! •  Metagenomics of communities!
  • 4.
    Considerations for agenome assembly pipeline •  Flexible pipeline: Handling unknown genotypes or virus samples! •  Platform independent: work with data from different platforms! •  Virus independent: work on any virus! •  Scalable to hundreds or thousands of samples! •  Accuracy of SNP calling in the genome (outbreak analysis where samples are more closely related)!
  • 5.
    Known reference" Unknownreference" Pre-assembly " Processing" Check format (sff, fastq) ! Convert to FASTQ! Remove adaptor contaminants! Remove host genome contamination! Quality & length trimming! Reference assembly! De-novo assembly! Contig merge! Scaffolding contigs! Validation! Consensus! Variant calling! Classification! Assembly" Post-assembly processing" Annotation! Genome comparison!
  • 6.
    Examples 1.  1999-2001 inNorthern Italy: emergence of highly pathogenic avian influenza H7N1! •  Identify known molecular markers for viral pathogenicity in intra-host viral populations! •  OIE & FAO reference lab for Influenza! 2.  2010 in the Netherlands: die-off of >1000 wild water frogs and newts! •  Isolation, characterisation and relationship to known viruses of the Dutch frog killer! •  Van Beurden et al. (2014). Genome Announc.! hybrid Edible frog ! (Pelophylax kl. esculentus)!
  • 7.
    Example 1:
 Characterization ofHPAI signature mutations" Monne et al. (2014). Journal of Virology!
  • 8.
  • 9.
    Reference assemblers?" •  Hashbased tools: Mosaik, Novoalign, Stampy, Tanoti! •  Borrrows-Wheeler Transform-based tools: BWA, Bowtie2, NextGenMap! Too many to choose from! http://www.bioinformatics.cvr.ac.uk/Tanoti! HA position log10(DOC) 0.0 0.5 1.0 1.5 2.0 2.5 500 1000 1500 M position log10(DOC) 0 1 2 3 4 200 400 600 800 1000 NA position log10(DOC) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 200 400 600 800 1000 1200 NP position log10(DOC) 0 1 2 3 500 1000 1500 NS position log10(DOC) 0 1 2 3 4 200 400 600 800 PA position log10(DOC) 0.0 0.5 1.0 1.5 2.0 2.5 500 1000 1500 2000 PB1 position log10(DOC) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 500 1000 1500 2000 PB2 position log10(DOC) 0 1 2 3 500 1000 1500 2000 Bowtie2 and Stampy ! Tanoti! !
  • 10.
  • 11.
    Variant calling –detecting true mutations" •  Many tools LoFreq, Vphaser, DiversiTools! •  Using replicates to validate mutations (e.g. FMDV experiments)! ! One LPAI sample collected after the identification of HPAI with an HA cleavage site and multiple HPAI associated mutations at extremely low frequency! PB2_I398T PB1_D154G PB1_G216S PB1_E745K PA_T61I PA_K115N PA_K252E HA_A130T HA_T146A HA_E228A HA_T454A HA_R554K NP_A349T NP_N376S NA_K173R M1_A166V NS1_I136V NS1_N139D NS1_-225R X4756.99 X4827.99 X4828.99 X4911.99 X4708.99 X4618.99 X4618.99.1 X4749.99 PB2_I398T PB1_D154G PB1_G216S PB1_E745K PA_T61I PA_K115N PA_K252E HA_A130T HA_T146A HA_E228A HA_T454A HA_R554K NP_A349T NP_N376S NA_K173R M1_A166V NS1_I136V NS1_N139D NS1_-225R X4295.99 X3675.99 X4829.99 X1744.99 X2732.99 X3283.99 Frequency of LPAI in HPAI samples ! Frequency of HPAI in LPAI samples ! Amino acid changes! Samples! Amino acid changes!
  • 12.
    Example 2:
 Isolation andSequencing" •  From dead wild water frog in September 2013! •  Suspension from pooled internal organs! •  Inoculated on BF-2 cells (Bluegill Fry cells fibroblast)! •  DNA extracted using Dneasy kit (viral purity of 67%! •  DNA sheared by sonication! •  KAPA library preparation! •  MiSeq (Illumina) Machine #2 test run: total run 26,700,000 reads including 50% PhiX (16Gb)! •  13,127,123 paired-end 300 bp reads from the sample (7.9 Gb)!
  • 13.
    Assembly" •  Abyss-pe de-novoassembler reconstructed the full- genome in a single contig of 107,260! •  5 different regions had ambiguous/repetitive sequences ! •  Re-sequencing ambiguous regions with Sanger! 1! 1692! 1693! 21168! 21359! 38364! 38387! 66887! 67100! 73322! 73434! 107260! ?! ?! ?! ?! ?!
  • 14.
    Finishing assembly" •  CodonCodeAligner for assembling and checking the Sanger sequences! •  SequencePatcher.pl to stitch the Sanger sequences into the de-novo contig! •  iCORN2! •  Final genome of 107,260 => 107,772bp!
  • 15.
    Annotating •  BLAST tofind the most similar annotated genome! •  Common Midwife Toad Virus (CMTV) from Spain! •  Transfer of annotations from CMTV to the full genome (RATT)! •  Identifies inappropriate start codons, frame-shifts! •  Correcting of transferred models using Artemis!
  • 16.
    20 kb RGV JQ654586 STIVEU627010 FV3 KJ175144 FV3 AY548484 TFV AF389451 CGSIV KF512820 ADRV KF033124 ADRV KC865735 CMTV NL CMTV JQ231222 ATV AY150217 EHNV FJ433873 ESV JQ724856 84! 95! 100! 100! 76! 100! 100! 100!
  • 17.
    Standard formats" •  FASTQ– quality score depends on the technology and base caller! ! •  SAM – soon v1.5 extensions!
  • 18.
    Genome standards –5 categories! Ladner et al.(2014) mBio ! % genome! covered! ! >50%! ! ! ~80-90%! ! ! ~90-99%! ! ! 100%! ! ! 100%! ! HTS! coverage! ! ! ! ! ~15-30 x! ! ! >100 x! ! ! RACE! ! ~ 400 ! – 1000 x! !
  • 19.
    1990 1992 19941996 1998 2000 2003 2004 2006 2008 2010 2012 0 1 10 100 1,000 10,000 100,000 1,000,000 0.1 1 10 100 1000 10,000 100,000 1,000,000 10,000,000 100,000,000 Year Diskstorage(Mbytes/$) DNAsequencing(bp/$) Hard disk storage (MB/$) Doubling time 14 months Pre-NGS (bp/$) Doubling time 19 months - NGS (bp/$) Doubling time 5 months http://genomebiology.com/2010/11/5/207! Challenges: Rates of increase in data"
  • 20.
    Challenges: resources and technologies" • Shift towards more data, labs need to have dedicated bioinformaticians! •  Rule of thumb: invest as much in computers and data scientists as in sequencing equipment and lab technicians! •  Non-uniform coverage, repeat regions, systematic biases, PCR errors, sequencing errors, sequence length!
  • 21.
    CVR bioinformatics team! Directorof OIE Collaborating Centre for Viral Genomics and Bioinformatics! Director of Centre for Virus Research!