Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
How to Standardise and Assemble Raw Data into Sequences:

What Does it Mean for a Laboratory to Use Such Technologies?"
Dr...
Decreasing sequencing
cost!
$0.01
$0.10
$1.00
$10.00
$100.00
$1,000.00
$10,000.00
Jul-98 Apr-01 Jan-04 Oct-06 Jul-09 Apr-1...
Applications of High throughput
sequencing"
•  Whole genome sequencing!
•  Genome variability within a host!
•  De-novo as...
Considerations for a genome
assembly pipeline
•  Flexible pipeline: Handling unknown genotypes or virus
samples!
•  Platfo...
Known reference" Unknown reference"
Pre-assembly "
Processing"
Check format (sff, fastq) !
Convert to FASTQ!
Remove adapto...
Examples
1.  1999-2001 in Northern Italy:
emergence of highly pathogenic
avian influenza H7N1!
•  Identify known molecular ...
Example 1:

Characterization of HPAI signature
mutations"
Monne et al. (2014). Journal of Virology!
Pre-assembly processing"
trim_galore and
FastQC for quality
control!
Reference assemblers?"
•  Hash based tools: Mosaik, Novoalign, Stampy, Tanoti!
•  Borrrows-Wheeler Transform-based tools: ...
Tablet - assembly
Variant calling – detecting true
mutations"
•  Many tools LoFreq, Vphaser, DiversiTools!
•  Using replicates to validate m...
Example 2:

Isolation and Sequencing"
•  From dead wild water frog in September 2013!
•  Suspension from pooled internal o...
Assembly"
•  Abyss-pe de-novo assembler reconstructed the full-
genome in a single contig of 107,260!
•  5 different regio...
Finishing assembly"
•  CodonCode Aligner for assembling and checking the
Sanger sequences!
•  SequencePatcher.pl to stitch...
Annotating
•  BLAST to find the most similar annotated genome!
•  Common Midwife Toad Virus (CMTV) from Spain!
•  Transfer ...
20 kb
RGV JQ654586
STIV EU627010
FV3 KJ175144
FV3 AY548484
TFV AF389451
CGSIV KF512820
ADRV KF033124
ADRV KC865735
CMTV NL...
Standard formats"
•  FASTQ – quality score depends on the technology and
base caller!
!
•  SAM – soon v1.5 extensions!
Genome standards – 5 categories!
Ladner et al.(2014) mBio !
% genome!
covered!
!
>50%!
!
!
~80-90%!
!
!
~90-99%!
!
!
100%!...
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
0
1
10
100
1,000
10,000
100,000
1,000,000
0.1
1
10
100
1000
10...
Challenges: resources and
technologies"
•  Shift towards more data, labs need to have dedicated
bioinformaticians!
•  Rule...
CVR bioinformatics team!
Director of OIE Collaborating Centre
for Viral Genomics and Bioinformatics!
Director of Centre fo...
Upcoming SlideShare
Loading in …5
×

How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such Technologies?

2,243 views

Published on

11th OIE Seminar at the XVII INTERNATIONAL SYMPOSIUM OF THE WORLD ASSOCIATION OF VETERINARY LABORATORY DIAGNOSTICIANS (WAVLD)
Saskatoon - 17th June 2015

Published in: Science
  • Be the first to like this

How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such Technologies?

  1. 1. How to Standardise and Assemble Raw Data into Sequences:
 What Does it Mean for a Laboratory to Use Such Technologies?" Dr Joseph Hughes! ! !11th OIE Seminar! Saskatoon - 17th June 2015!
  2. 2. Decreasing sequencing cost! $0.01 $0.10 $1.00 $10.00 $100.00 $1,000.00 $10,000.00 Jul-98 Apr-01 Jan-04 Oct-06 Jul-09 Apr-12Dec-14Sep-17 Cost per raw Megabase of DNA sequence! http://www.genome.gov/ sequencingcosts! Democratization of sequencing! http://omicsmaps.com!
  3. 3. Applications of High throughput sequencing" •  Whole genome sequencing! •  Genome variability within a host! •  De-novo assembly of novel viruses! •  Metagenomics of communities!
  4. 4. Considerations for a genome assembly pipeline •  Flexible pipeline: Handling unknown genotypes or virus samples! •  Platform independent: work with data from different platforms! •  Virus independent: work on any virus! •  Scalable to hundreds or thousands of samples! •  Accuracy of SNP calling in the genome (outbreak analysis where samples are more closely related)!
  5. 5. Known reference" Unknown reference" Pre-assembly " Processing" Check format (sff, fastq) ! Convert to FASTQ! Remove adaptor contaminants! Remove host genome contamination! Quality & length trimming! Reference assembly! De-novo assembly! Contig merge! Scaffolding contigs! Validation! Consensus! Variant calling! Classification! Assembly" Post-assembly processing" Annotation! Genome comparison!
  6. 6. Examples 1.  1999-2001 in Northern Italy: emergence of highly pathogenic avian influenza H7N1! •  Identify known molecular markers for viral pathogenicity in intra-host viral populations! •  OIE & FAO reference lab for Influenza! 2.  2010 in the Netherlands: die-off of >1000 wild water frogs and newts! •  Isolation, characterisation and relationship to known viruses of the Dutch frog killer! •  Van Beurden et al. (2014). Genome Announc.! hybrid Edible frog ! (Pelophylax kl. esculentus)!
  7. 7. Example 1:
 Characterization of HPAI signature mutations" Monne et al. (2014). Journal of Virology!
  8. 8. Pre-assembly processing" trim_galore and FastQC for quality control!
  9. 9. Reference assemblers?" •  Hash based tools: Mosaik, Novoalign, Stampy, Tanoti! •  Borrrows-Wheeler Transform-based tools: BWA, Bowtie2, NextGenMap! Too many to choose from! http://www.bioinformatics.cvr.ac.uk/Tanoti! HA position log10(DOC) 0.0 0.5 1.0 1.5 2.0 2.5 500 1000 1500 M position log10(DOC) 0 1 2 3 4 200 400 600 800 1000 NA position log10(DOC) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 200 400 600 800 1000 1200 NP position log10(DOC) 0 1 2 3 500 1000 1500 NS position log10(DOC) 0 1 2 3 4 200 400 600 800 PA position log10(DOC) 0.0 0.5 1.0 1.5 2.0 2.5 500 1000 1500 2000 PB1 position log10(DOC) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 500 1000 1500 2000 PB2 position log10(DOC) 0 1 2 3 500 1000 1500 2000 Bowtie2 and Stampy ! Tanoti! !
  10. 10. Tablet - assembly
  11. 11. Variant calling – detecting true mutations" •  Many tools LoFreq, Vphaser, DiversiTools! •  Using replicates to validate mutations (e.g. FMDV experiments)! ! One LPAI sample collected after the identification of HPAI with an HA cleavage site and multiple HPAI associated mutations at extremely low frequency! PB2_I398T PB1_D154G PB1_G216S PB1_E745K PA_T61I PA_K115N PA_K252E HA_A130T HA_T146A HA_E228A HA_T454A HA_R554K NP_A349T NP_N376S NA_K173R M1_A166V NS1_I136V NS1_N139D NS1_-225R X4756.99 X4827.99 X4828.99 X4911.99 X4708.99 X4618.99 X4618.99.1 X4749.99 PB2_I398T PB1_D154G PB1_G216S PB1_E745K PA_T61I PA_K115N PA_K252E HA_A130T HA_T146A HA_E228A HA_T454A HA_R554K NP_A349T NP_N376S NA_K173R M1_A166V NS1_I136V NS1_N139D NS1_-225R X4295.99 X3675.99 X4829.99 X1744.99 X2732.99 X3283.99 Frequency of LPAI in HPAI samples ! Frequency of HPAI in LPAI samples ! Amino acid changes! Samples! Amino acid changes!
  12. 12. Example 2:
 Isolation and Sequencing" •  From dead wild water frog in September 2013! •  Suspension from pooled internal organs! •  Inoculated on BF-2 cells (Bluegill Fry cells fibroblast)! •  DNA extracted using Dneasy kit (viral purity of 67%! •  DNA sheared by sonication! •  KAPA library preparation! •  MiSeq (Illumina) Machine #2 test run: total run 26,700,000 reads including 50% PhiX (16Gb)! •  13,127,123 paired-end 300 bp reads from the sample (7.9 Gb)!
  13. 13. Assembly" •  Abyss-pe de-novo assembler reconstructed the full- genome in a single contig of 107,260! •  5 different regions had ambiguous/repetitive sequences ! •  Re-sequencing ambiguous regions with Sanger! 1! 1692! 1693! 21168! 21359! 38364! 38387! 66887! 67100! 73322! 73434! 107260! ?! ?! ?! ?! ?!
  14. 14. Finishing assembly" •  CodonCode Aligner for assembling and checking the Sanger sequences! •  SequencePatcher.pl to stitch the Sanger sequences into the de-novo contig! •  iCORN2! •  Final genome of 107,260 => 107,772bp!
  15. 15. Annotating •  BLAST to find the most similar annotated genome! •  Common Midwife Toad Virus (CMTV) from Spain! •  Transfer of annotations from CMTV to the full genome (RATT)! •  Identifies inappropriate start codons, frame-shifts! •  Correcting of transferred models using Artemis!
  16. 16. 20 kb RGV JQ654586 STIV EU627010 FV3 KJ175144 FV3 AY548484 TFV AF389451 CGSIV KF512820 ADRV KF033124 ADRV KC865735 CMTV NL CMTV JQ231222 ATV AY150217 EHNV FJ433873 ESV JQ724856 84! 95! 100! 100! 76! 100! 100! 100!
  17. 17. Standard formats" •  FASTQ – quality score depends on the technology and base caller! ! •  SAM – soon v1.5 extensions!
  18. 18. Genome standards – 5 categories! Ladner et al.(2014) mBio ! % genome! covered! ! >50%! ! ! ~80-90%! ! ! ~90-99%! ! ! 100%! ! ! 100%! ! HTS! coverage! ! ! ! ! ~15-30 x! ! ! >100 x! ! ! RACE! ! ~ 400 ! – 1000 x! !
  19. 19. 1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012 0 1 10 100 1,000 10,000 100,000 1,000,000 0.1 1 10 100 1000 10,000 100,000 1,000,000 10,000,000 100,000,000 Year Diskstorage(Mbytes/$) DNAsequencing(bp/$) Hard disk storage (MB/$) Doubling time 14 months Pre-NGS (bp/$) Doubling time 19 months - NGS (bp/$) Doubling time 5 months http://genomebiology.com/2010/11/5/207! Challenges: Rates of increase in data"
  20. 20. Challenges: resources and technologies" •  Shift towards more data, labs need to have dedicated bioinformaticians! •  Rule of thumb: invest as much in computers and data scientists as in sequencing equipment and lab technicians! •  Non-uniform coverage, repeat regions, systematic biases, PCR errors, sequencing errors, sequence length!
  21. 21. CVR bioinformatics team! Director of OIE Collaborating Centre for Viral Genomics and Bioinformatics! Director of Centre for Virus Research!

×