How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such Technologies?

How to Standardise and Assemble Raw Data into Sequences: 
What Does it Mean for a Laboratory to Use Such Technologies?"
Dr Joseph Hughes!
!
!11th OIE Seminar!
Saskatoon - 17th June 2015!

Decreasing sequencing
cost!
$0.01
$0.10
$1.00
$10.00
$100.00
$1,000.00
$10,000.00
Jul-98 Apr-01 Jan-04 Oct-06 Jul-09 Apr-12Dec-14Sep-17
Cost per raw Megabase of DNA sequence!
http://www.genome.gov/
sequencingcosts!
Democratization of
sequencing!
http://omicsmaps.com!

Applications of High throughput
sequencing"
•  Whole genome sequencing!
•  Genome variability within a host!
•  De-novo assembly of novel viruses!
•  Metagenomics of communities!

Considerations for a genome
assembly pipeline
•  Flexible pipeline: Handling unknown genotypes or virus
samples!
•  Platform independent: work with data from different
platforms!
•  Virus independent: work on any virus!
•  Scalable to hundreds or thousands of samples!
•  Accuracy of SNP calling in the genome (outbreak analysis
where samples are more closely related)!

Known reference" Unknown reference"
Pre-assembly "
Processing"
Check format (sff, fastq) !
Convert to FASTQ!
Remove adaptor contaminants!
Remove host genome contamination!
Quality & length trimming!
Reference assembly!
De-novo assembly!
Contig merge!
Scaffolding contigs!
Validation!
Consensus!
Variant calling!
Classiﬁcation!
Assembly"
Post-assembly processing"
Annotation!
Genome comparison!

Examples
1.  1999-2001 in Northern Italy:
emergence of highly pathogenic
avian inﬂuenza H7N1!
•  Identify known molecular markers for viral
pathogenicity in intra-host viral populations!
•  OIE & FAO reference lab for Inﬂuenza!
2.  2010 in the Netherlands: die-off of
>1000 wild water frogs and newts!
•  Isolation, characterisation and relationship to
known viruses of the Dutch frog killer!
•  Van Beurden et al. (2014). Genome Announc.!
hybrid Edible frog !
(Pelophylax kl. esculentus)!

Example 1: 
Characterization of HPAI signature
mutations"
Monne et al. (2014). Journal of Virology!

Pre-assembly processing"
trim_galore and
FastQC for quality
control!

Reference assemblers?"
•  Hash based tools: Mosaik, Novoalign, Stampy, Tanoti!
•  Borrrows-Wheeler Transform-based tools: BWA, Bowtie2,
NextGenMap!
Too many to choose from!
http://www.bioinformatics.cvr.ac.uk/Tanoti!
HA
position
log10(DOC)
0.0
0.5
1.0
1.5
2.0
2.5
500 1000 1500
M
position
log10(DOC)
0
1
2
3
4
200 400 600 800 1000
NA
position
log10(DOC)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
200 400 600 800 1000 1200
NP
position
log10(DOC)
0
1
2
3
500 1000 1500
NS
position
log10(DOC)
0
1
2
3
4
200 400 600 800
PA
position
log10(DOC)
0.0
0.5
1.0
1.5
2.0
2.5
500 1000 1500 2000
PB1
position
log10(DOC)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
500 1000 1500 2000
PB2
position
log10(DOC)
0
1
2
3
500 1000 1500 2000
Bowtie2 and Stampy !
Tanoti!
!

Variant calling – detecting true
mutations"
•  Many tools LoFreq, Vphaser, DiversiTools!
•  Using replicates to validate mutations (e.g. FMDV
experiments)!
!
One LPAI sample collected after the identiﬁcation of HPAI
with an HA cleavage site and multiple HPAI associated
mutations at extremely low frequency!
PB2_I398T
PB1_D154G
PB1_G216S
PB1_E745K
PA_T61I
PA_K115N
PA_K252E
HA_A130T
HA_T146A
HA_E228A
HA_T454A
HA_R554K
NP_A349T
NP_N376S
NA_K173R
M1_A166V
NS1_I136V
NS1_N139D
NS1_-225R
X4756.99
X4827.99
X4828.99
X4911.99
X4708.99
X4618.99
X4618.99.1
X4749.99
PB2_I398T
PB1_D154G
PB1_G216S
PB1_E745K
PA_T61I
PA_K115N
PA_K252E
HA_A130T
HA_T146A
HA_E228A
HA_T454A
HA_R554K
NP_A349T
NP_N376S
NA_K173R
M1_A166V
NS1_I136V
NS1_N139D
NS1_-225R
X4295.99
X3675.99
X4829.99
X1744.99
X2732.99
X3283.99
Frequency of LPAI in HPAI samples ! Frequency of HPAI in LPAI samples !
Amino acid changes!
Samples!
Amino acid changes!

Example 2: 
Isolation and Sequencing"
•  From dead wild water frog in September 2013!
•  Suspension from pooled internal organs!
•  Inoculated on BF-2 cells (Bluegill Fry cells ﬁbroblast)!
•  DNA extracted using Dneasy kit (viral purity of 67%!
•  DNA sheared by sonication!
•  KAPA library preparation!
•  MiSeq (Illumina) Machine #2 test run: total run 26,700,000
reads including 50% PhiX (16Gb)!
•  13,127,123 paired-end 300 bp reads from the sample (7.9
Gb)!

Assembly"
•  Abyss-pe de-novo assembler reconstructed the full-
genome in a single contig of 107,260!
•  5 different regions had ambiguous/repetitive sequences !
•  Re-sequencing ambiguous regions with Sanger!
1!
1692!
1693!
21168!
21359!
38364!
38387!
66887!
67100!
73322!
73434!
107260!
?! ?! ?! ?! ?!

Finishing assembly"
•  CodonCode Aligner for assembling and checking the
Sanger sequences!
•  SequencePatcher.pl to stitch the Sanger sequences into
the de-novo contig!
•  iCORN2!
•  Final genome of 107,260 => 107,772bp!

Annotating
•  BLAST to ﬁnd the most similar annotated genome!
•  Common Midwife Toad Virus (CMTV) from Spain!
•  Transfer of annotations from CMTV to the full genome
(RATT)!
•  Identiﬁes inappropriate start codons, frame-shifts!
•  Correcting of transferred models using Artemis!

20 kb
RGV JQ654586
STIV EU627010
FV3 KJ175144
FV3 AY548484
TFV AF389451
CGSIV KF512820
ADRV KF033124
ADRV KC865735
CMTV NL
CMTV JQ231222
ATV AY150217
EHNV FJ433873
ESV JQ724856
84!
95!
100!
100!
76!
100!
100!
100!

Standard formats"
•  FASTQ – quality score depends on the technology and
base caller!
!
•  SAM – soon v1.5 extensions!

Genome standards – 5 categories!
Ladner et al.(2014) mBio !
% genome!
covered!
!
>50%!
!
!
~80-90%!
!
!
~90-99%!
!
!
100%!
!
!
100%!
!
HTS!
coverage!
!
!
!
!
~15-30 x!
!
!
>100 x!
!
!
RACE!
!
~ 400 !
– 1000 x!
!

1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
0
1
10
100
1,000
10,000
100,000
1,000,000
0.1
1
10
100
1000
10,000
100,000
1,000,000
10,000,000
100,000,000
Year
Diskstorage(Mbytes/$)
DNAsequencing(bp/$)
Hard disk storage (MB/$)
Doubling time 14 months
Pre-NGS (bp/$)
-
NGS (bp/$)
http://genomebiology.com/2010/11/5/207!
Challenges: Rates of increase in data"

Challenges: resources and
technologies"
•  Shift towards more data, labs need to have dedicated
bioinformaticians!
•  Rule of thumb: invest as much in computers and data
scientists as in sequencing equipment and lab
technicians!
•  Non-uniform coverage, repeat regions, systematic biases,
PCR errors, sequencing errors, sequence length!

CVR bioinformatics team!
Director of OIE Collaborating Centre
for Viral Genomics and Bioinformatics!
Director of Centre for Virus Research!

How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such Technologies?

More Related Content

What's hot

Similar to How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such Technologies?

Recently uploaded

How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such Technologies?