SlideShare a Scribd company logo
1 of 46
How Next Generation Sequencing
(NGS) are Analyzed From Raw
Data to Sequence Contigs
Madhu Anand, DrPH
NYSDOH Bureau of Communicable
Disease Control
E-mail: madhu.anand@health.ny.gov
Objectives
• Understand the process of obtaining a contiguous
sequence
• Describe the technology needed for conducting NGS
• Understand differences between conducting NGS for
viral vs bacterial vs eukaryotic organisms
• Explain the difference between reference and de-novo
based assemblies
• Understand the quantitative measures used to assess
the quality of sequencing data.
• Differences between SNP and MLST analysis
Overall NGS Process
• Specimen
• Broken up into fragments
• From the fragments develop sequence reads
• Sequence reads are assembled into
contiguous reads
• Contiguous reads are put together in order
• These are then compared
3
Why It is called NGS
• Sanger Sequencing (First generation)
• Most widely used sequencing technology for
approximately 25 years
• Next Generation Sequencing
• “AKA” high-throughput sequencing
• Includes most sequencing technologies that came after
Sanger sequencing
• Technologies
• Second Generation (Short-read)
• Illumina
• Ion Torrent
• 454 pyrosequencing (Legacy Technology)
• Third Generation (Long-read)
• Pac-Bio
• Oxford Nanopore
4
Sanger vs NGS
• Both allow for whole genome sequencing
• NGS allows for massively parallel sequencing of
target genes
• Both technology standards have their utility in
genomics today
• Cost
• Sequence read length
• Massively Parallel
• Millions of fragments are sequenced in a single run
vs one forward and reverse read
5
Platform Characteristics
Platform Read length
(bp)
Isolates per
run (max)
Run time Instrument
cost
Cost/
Mb
Illumina
HiSeq 2500
150 600-1000 5-11 d $740K $0.05
Illumina
MiSeq
150, 250, 300 12-16 26 h, 36 h
65 h
$99K $1.37
Illumina
NextSeq
75, 150 96 29 h $250K 0.03-
0.07
IonTorrent
PGM (314,
316, 318)
200, 400 1-10 2-8h $75K $0.93 -
$7.5
Ion Proton 100-200 96 2-4 h $245K $0.02
PacBio RSII 10,000-
40,000
8 /smrt cell 0.5-2 h $750K $180.00
Sanger 650 96 1 h $100k $2800
Romer-Carleton
Platforms and Pathogens
• Read length
• Size of genome being sequenced
7
Viral Genome
• Compact
• ~10,000 nucleotides (nt) (typically, ~3,000–200,000)
• Little wasted space
• Variable composition
• DNA; RNA
• Single-stranded; double stranded
• Linear; circular
• Single; segmented
• Often highly variable
• Particularly true of ssRNA viruses
• Quasispecies. Example: hepatitis C virus
AMD
–
Innovate
*
Transform
*
Protect
Example: HBV Genome
~3200 nt
Bacterial Genome
• Larger
• ~5m nt (typically, ~2m–10m)
• Pan-genome
• Core genome: ~3,000–5,000 genes, present in most
strains of a given species
• Accessory genome: up to thousands of genes, not
always present
• Structure
• dsDNA
• Usually single, circular chromosome
Bacterial Genome Complications
• Complications
• Plasmids
• Circular, dsDNA structures that replicate independently
from chromosome
• Often carry resistance or virulence genes
• Can be passed from one bacterium to another
• Phages
• Viruses that infect bacteria
• Genome can integrate into chromosome
• Other repetitive elements
AMD
–
Innovate
*
Transform
*
Protect
Eukaryotes
• Huge
• Billions of nucleotides long
• Human genome: ~3B nt
• Structure
• dsDNA
• Multiple chromosomes
• Most of human genome has no apparent
function
• Introns: (intragenic region) non-coding segments
(~99%)
• Exons: (expressed region) coding segments
(~1%)
Viral vs. Bacterial vs.
Eukaryotic Genome Summary
• Viral genomes: compact, efficient organization.
Size: 1,000s of nt.
• Bacterial genomes: more complex, repetitive
elements. Size: 1,000,000s of nt.
• Eukaryotic genomes: mostly introns,
1,000,000,000s of nt.
AMD
–
Innovate
*
Transform
*
Protect
Relative Genome Size
1k 10kb 100kb 1Mb 10Mb 100Mb 1Gb 10Gb 100Gb
Viruses
Plasmids
Archaea
Bacteria
Fungi
Protists
Plants
Algae
Nematodes
Insects
Mollusks
Fish
Amphibians
Reptiles
Birds
Mammals
(haploid bp)
3Gb
Mobile Genetic Elements
• Are sequences that can move around within and
between genome.
• The “mobilome” includes:
• Plasmids
• Transposons
• Bacteriophages
• In prokaryotes, there is horizontal gene transfer
between organisms.
• Mobile genetic elements enable horizontal gene
transfer, and consequently are an important
factor in acquired virulence, antimicrobial
resistance and bacterial evolution.
https://en.wikipedia.org/wiki/Mobile_genetic_elements
Methods Mol Biol. 2009;532:13-27. doi: 10.1007/978-1-60327-853-9_2
Mobile Genetic Components
• Allows two organisms to become different very
quickly
• Resistance
• Might have millions more nucleotides
• Interpretation of DNA become difficult
• Need to take these components into consideration
• Do you keep them in analysis
Plasmids
• Small (1kb-200kb) extrachromosomal
sequences that are circular and self-
replicating. Varying copy number.
• Present in both bacteria and eukaryotic
cells.
• Five functional categories of plasmids:
• Fertility, Resistance, Col-plasmid,
degradative, virulence.
• Compatibility: similar plasmids usually
can’t coexist.
• Plasmids are an important mechanism of
horizontal gene transfer, and play critical
role in bacterial diversity.
• HGT: AR, virulence, metabolism, other
acquired features.
• Used as a cloning vector for research
• Can integrate/recombine with the host
genome.
Bourgogne 2003 doi: 10.1128/IAI.71.5.2736-2743.2003
Methods Mol Biol. 2009;532:13-27. doi: 10.1007/978-1-60327-853-9_2
Bacteriophage
• Bacteriophage are viruses that infect bacteria
• Extremely diverse group of viruses
• 5kb to 500kb ss/dsDNA or ss/dsRNA genome
• Inject payload into the target bacteria, hijack its
replicative machinery to reproduce.
• Lysogenic phages:
• Lysogenic phages do not cause immediate lysis of
the cell, but instead incorporate into the host
genome.
• Viral payload may remain dormant as a prophage.
This prophage may alter the phenotype of the cell
in important ways (eg: p0157)
• Excision of the phage is also an imperfect process
and may include adjacent sequences from the
host genome that can be “transmitted” by the
phage to subsequent hosts.
Transposons
• A transposon, or transposable genetic element, can
mediate its own movement within the genome or
between hosts.
• Common feature of most genomes, not just bacteria.
• Encodes a transposase, which cuts the donor and
target sequences, and performs strand exchange.
• Targeting can be fairly indiscriminate – may knock
out gene function. This property is actually used as a
research tool.
Tn10, Salmonella typhi
9147bp, encodes TetR
Non-replicative cut and paste
5’ and 3’ end include IS and a pair putative transposons
Tn10
Sequencing Process Definitions
• Sequence – generic name describing order of
biological letters (DNA/RNA or amino acids).
• Both reads and contigs are DNA/RNA or amino
acid sequences
• Reads – sequenced reads of base pairs as you are
trying to assemble
• Contigs - reads that have been assembled
together; final product
21
Why do we need to assemble?
• Bacterial genomes range in size from as few
as 3 to >5 million nucleotides (A, C, T, and
G’s).
• Genomes are broken into pieces of about 250
nucleotides to sequence
• New technologies can read >10,000 nucleotides at
once
• A large percentage of the sequence encode
proteins (these regions are called genes)
• Those pieces then need to be aligned so they
can be compared to each other
22
Making an assemblage
• Two different methods
• Reference guided – When you have “map” of final
product, and try to match your pieces together to look
like final product
• Jigsaw puzzle – you have picture and put pieces
together
• Lyrics of song – you know words to song, and put
words in correct order
• De Novo – Put pieces together by what “makes
sense”
• Jigsaw puzzle – you may not have picture, but can
put pieces together by what fits together
• Lyrics of song – you may not know words of song,
but can figure out sentences from the words.
24
Example: De Novo Assembly
Example: De Novo assembly of lyrics, using “reads” of 4 to 6 words
(each word is a base pair/amino acid):
Reads:
1: yeah there will be an
2: tomorrow let it be o will
3: let it be and though they
4: me speaking words of wisdom
5: let it be let
6: the night is cloudy there
7: be let it be let it
8: be whisper words of wisdom
9: on me shine until tomorrow
10: let it be let it be let
11: be and when the broken hearted
12: answer let it be and though
13: it be let it be
14: and though the night is
15: the broken hearted people living
25
Assembly “Pile-Up”
… standing right in front of me…
… let it be and in my hour of darkness is standing right in front of me…
…wisdom let it be and in she is standing right in front …
…wisdom let it be darkness she is of me…
…wisdom let and in my hour of darkness she is standing in front of me…
…wisdom let it be and in my hour of darkness she is standing right in front of me…
… wisdom let it be let
… wads of wisdom let it be let it be let
… speaking words of wisdom let
… speaking words of
… speaking words of wisdom let it be
trouble Mother Mary comes to me
in times of trouble Mother Mary comes words of…
myself if times of trouble Mother me speaking warts of…
when I find myself in times Mary comes to me speaking words of…
when I find myself in times of trouble Mother Mary comes to me speaking words of…
aligned reads
assembled contig
coverage = 3
coverage = 4
(note errors in some reads)
26
The Problem of Repeats
let it be let it
it be let it be let words …
let it be let it be it be speaking words …
… of wisdom let it be it be let it be
… of wisdom let it be let it be let it be speaking words …
let it be let it be let it be
it be let it be let be let it be let it words …
let it be let it be it be let it be it be speaking words …
… of wisdom let it be it be let it be it be let it be
… of wisdom let it be let it be let it be let it be let it be speaking words …
let it be let it it be speaking words …
it be let it be let be let it be
let it be let it be it be let it
… of wisdom let it be it be let it be words …
… of wisdom let it be let it be let it be let it be speaking words …
?
?
?
27
Final Assembly
when I find myself in times of trouble Mother Mary comes to me speaking
words of wisdom let it be and in my hour of darkness she is standing right
in front of me speaking words of wisdom
and though the night is cloudy there is still a light that shines on me
shine until tomorrow let it be o will I make up to the sound of music
Mother Mary comes to me speaking words of wisdom
whisper words of wisdom let it be and when the broken hearted people living
in the world agree there will be an answer let it be and though they may be
parted there is still a chance that they will see there will be an answer
3 contigs:
Many unused reads: be let it be
let it be let it
it be let it be
let it be let it be let
be let it be let
be let it be
it be let it be let
28
Reference-Guided (Mapped)
Assembly
Reference Sequence/Genome
Low sequence coverage
UNMAPPED READS
1. Sequences not present in the
reference.
2. Plasmids or other
extrachromosomal.
3. DNA Structural
Variation/Rearrangement
ADVANTAGES: Relatively fast, well-
suited to highly-conserved genomes.
DISADVANTAGES: Issues with high
diversity, mobile elements
Coverage
18X
1X
Contig 1 Contig 2
Example software: BWA (https://github.com/lh3/bwa)
breseq (https://github.com/barricklab/breseq)
29
PLASMID
De Novo Assembly
Contig 1 Contig 2 Contig 3
Contig 4 Contig 5 Contig 6 Contig 7
ADVANTAGES: Reference agnostic:
assembles all the reads it can. Various
algorithms.
DISADVANTAGES: Doesn’t always get
things right. Particularly with complex
repeats.
Example software: SPAdes (http://bioinf.spbau.ru/spades)
List:
https://en.wikipedia.org/wiki/Sequence_assembly#Available_assemblers
AMD
–
Innovate
*
Transform
*
Protect
Campylobacter: 29 contigs; 1.61Mbases; N50: 151kbases
De Novo Assembly: Contigs
31
Improving and Assessing
Assembly Quality
• Assessing assemblies
• N50 – The length of the smallest contig in the set that contains the fewest (largest)
contigs whose combined length represents at least 50% of the assembly. Still used but
should not be only method. Other metrics - L50, NG50, N90.
• Contigs (largest, smallest, mean, median, size distribution)
• Coverage Plots – low coverage areas, high coverage areas (collapsed repeats)
• Quast – collection of assembly assessment tools with detailed reports (misassembly
detection, structural variation)
http://bioinformatics.oxfordjournals.org/content/29/8/1072.full.pdf
• GAGE – assessment tool used by annual Assemblathon
http://genome.cshlp.org/content/early/2012/01/12/gr.131383.111.full.pdf
• Improving assembly
• Higher Coverage
• Paired end instead of unpaired reads
• Longer read length
• Read quality
• Assemblers
Longer vs Shorter Reads
• Read length depends on instruments used
• Depending on goal, one may be better than
other
• Shorter reads
• Pro – economic
• Con – repeats, missing areas
• Long reads
• Pro - provide better coverage and higher
consensus
• Con - expensive
33
Benefit of Longer Reads
front of me speaking words of wisdom let
it be let it be let it be let it be let
it be whisper words of wisdom let it be
and when the broken
33-word read:*
font of me sporking wit of wisdom let it
see let it be leg if be let is but let
it be whimper words of doom set it be
and then the broken
Poor quality, 33-word read:*
* analogous to long-read sequencing, such as PacBio or nanopore sequencing
NGS Quality Control Metrics
• Q-score or Phred Quality Score - A Phred
quality score is a measure of the quality of the
identification of the nucleobases generated by
automated DNA sequencing
• Coverage- The average number of reads
representing a given nucleotide in the
reconstructed sequence.
• Insert Size - is the length of the DNA (or RNA)
that you want to sequence and that is
"inserted" between the adapters (so adapters
excluded).
34
35
The Alphabet Soup of Analysis –
Q-score
Quality scores - likelihood the base call is correct
Phred – part of fastq file generated from sequencer that scores base call quality
Q30 – the percentage of base calls that have a 1 in 1000 chance or less of being
incorrect (Q20 – 1 incorrect in 100 base calls)
indicates whether a base call is trustworthy and can be used in a hqSNP
analysis
Coverage
Inter-Quartile Range (IQR): The IQR is the difference in
sequencing coverage between the 75th and 25th percentiles of the
histogram. A high IQR indicates high variation in coverage across
the genome, while a low IQR reflects more uniform sequence
coverage.
36
*Adapted from Illumina.com
Evaluating Coverage
• Mean (Mapped) Read Depth: The mean
mapped read depth (or mean read depth) is
the sum of the mapped read depths at each
reference base position, divided by the number
of known bases in the reference.
• Raw Read Depth: This is the total amount of
sequence data produced by the instrument
(pre-alignment), divided by the reference
genome size.
37
Insert Size
38
What is a SNP?
• Single Nucleotide Polymorphism (SNP)
ATGTTCCTC sequence
ATGTTGCTC reference
*phylogentically informative differences
• Insertion or Deletion (Indel)
ATGTTCCCTC sequence
ATGTTC-CTC reference
*differences not used in hqSNP analysis
Whole genome multilocus
sequence typing (MLST)
• Database is built from gene content representing a
diverse selection of the genus/species of the organism
being compared
• Each unique gene is referred to as a “locus” – a locus
may include the entire gene or a piece of the gene
• Any changes – SNP, insertions, deletions – equals a
new allele call for a locus
• New alleles are named sequentially when encountered-
not based on sequence
Locus 1 ACTAGAGGGAAA ACTAGAGGCTAA ACT-GAGGGTAA
allele 1 allele 2 allele 3
2 SNPs 1 indel/ 1 SNP
Whole genome multilocus
sequence typing (MLST)
• Allows for simpler analysis and clear naming
of subtypes
• Performs comparison on a gene by gene
level
Isolate A Isolate B Isolate C
Locus 1 (20 nt) 1 1 1
Locus 2 (100nt) 8 8 12
Locus 3 (5000nt) 5 5 2
Etc.
Locus 2,005 (5nt) 4 4 4
wgMLST type A A B
MLST Analysis
• Faster than analyzing SNP differences
• MLST is conducted twice
• Comparisons made between allele calls made with
short reads
• Comparisons made between allele calls made with
contigs
• Eliminates machine noise that is present with
either method when done independently
42
SNP versus MLST Analysis
• Both analyses conducted from raw data
• For public health purposes, both correlate well
• i.e the outermost branches of phylogenetic trees
are almost identical
• The two are not mutually exclusive
• For some use cases MLST works better, others
SNP works better
43
Take Home Messages
• NGS needs time, resources, and expertise
• Need to understand
• when NGS should be done,
• what question you are trying to answer, and
• the metrics that go into a quality read
• SNP and MLST analysis can both be used
in public health and outbreak investigation
with largely similar results
44
Acknowledgments
• Centers for Disease Control and Prevention
• Greg Armstrong
• Peter Gerner-Shmidt
• John Besser
• Martin Wiedmann
• Integrated Food Safety Centers of Excellence
45
Questions
46

More Related Content

What's hot

Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation SequencingFarid MUSA
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencingRayhan Shahrear
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGScursoNGS
 
DNA Microarray introdution and application
DNA Microarray introdution and applicationDNA Microarray introdution and application
DNA Microarray introdution and applicationNeeraj Sharma
 
Next generation sequencing
Next  generation  sequencingNext  generation  sequencing
Next generation sequencingNidhi Singh
 
Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Sri Ambati
 
second generation of DNA Sequencing
second generation of DNA Sequencingsecond generation of DNA Sequencing
second generation of DNA SequencingSidra Shaffique
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods Mrinal Vashisth
 
Generations of sequencing technologies.
Generations of sequencing technologies. Generations of sequencing technologies.
Generations of sequencing technologies. ShadenAlharbi
 
Different methods of gene sequencing durgesh sirohi
Different methods of  gene sequencing   durgesh sirohiDifferent methods of  gene sequencing   durgesh sirohi
Different methods of gene sequencing durgesh sirohiD. Sirohi
 
Gene sequencing technique
Gene sequencing techniqueGene sequencing technique
Gene sequencing techniqueDarshan Patel
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotLi Shen
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analysesrjorton
 
Single cell RNA sequencing; Methods and applications
Single cell RNA sequencing; Methods and applicationsSingle cell RNA sequencing; Methods and applications
Single cell RNA sequencing; Methods and applicationsfaraharooj
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...QIAGEN
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
 
NGS data analysis Overview
NGS data analysis Overview NGS data analysis Overview
NGS data analysis Overview Ravi Gandham
 

What's hot (20)

Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation Sequencing
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
Introduction to NGS
Introduction to NGSIntroduction to NGS
Introduction to NGS
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
DNA Microarray introdution and application
DNA Microarray introdution and applicationDNA Microarray introdution and application
DNA Microarray introdution and application
 
Next generation sequencing
Next  generation  sequencingNext  generation  sequencing
Next generation sequencing
 
Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...
 
second generation of DNA Sequencing
second generation of DNA Sequencingsecond generation of DNA Sequencing
second generation of DNA Sequencing
 
Next generation sequencing methods
Next generation sequencing methods Next generation sequencing methods
Next generation sequencing methods
 
Generations of sequencing technologies.
Generations of sequencing technologies. Generations of sequencing technologies.
Generations of sequencing technologies.
 
Different methods of gene sequencing durgesh sirohi
Different methods of  gene sequencing   durgesh sirohiDifferent methods of  gene sequencing   durgesh sirohi
Different methods of gene sequencing durgesh sirohi
 
Gene sequencing technique
Gene sequencing techniqueGene sequencing technique
Gene sequencing technique
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
Single cell RNA sequencing; Methods and applications
Single cell RNA sequencing; Methods and applicationsSingle cell RNA sequencing; Methods and applications
Single cell RNA sequencing; Methods and applications
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
 
Intro to illumina sequencing
Intro to illumina sequencingIntro to illumina sequencing
Intro to illumina sequencing
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 
NGS data analysis Overview
NGS data analysis Overview NGS data analysis Overview
NGS data analysis Overview
 

Similar to CoE-WEBINAR-2_042117v3.pptx

Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...
Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...
Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...Promila Sheoran
 
UNIQUE AND REPETITIVE DNA.a derailed presentation
UNIQUE AND REPETITIVE DNA.a derailed presentationUNIQUE AND REPETITIVE DNA.a derailed presentation
UNIQUE AND REPETITIVE DNA.a derailed presentationkingmaxton8
 
Genetic Engineering by Kailash Sontakke Botany Sem-VI Unit-IV all
Genetic Engineering by Kailash Sontakke Botany Sem-VI Unit-IV allGenetic Engineering by Kailash Sontakke Botany Sem-VI Unit-IV all
Genetic Engineering by Kailash Sontakke Botany Sem-VI Unit-IV allKAILASHSONTAKKE
 
Chromatin structure "DNA+CHROMOSOME"
Chromatin structure "DNA+CHROMOSOME"Chromatin structure "DNA+CHROMOSOME"
Chromatin structure "DNA+CHROMOSOME"Mention Du
 
Recombinant DNA Technology
Recombinant DNA TechnologyRecombinant DNA Technology
Recombinant DNA TechnologyPrasenjit Mitra
 
1_7_genome_1.ppt
1_7_genome_1.ppt1_7_genome_1.ppt
1_7_genome_1.pptOmerBushra4
 
Advanced molecular biology.ppt
Advanced molecular biology.pptAdvanced molecular biology.ppt
Advanced molecular biology.pptMUHAMMEDBAWAYUSUF
 
GENOME_STRUCTURE1.ppt
GENOME_STRUCTURE1.pptGENOME_STRUCTURE1.ppt
GENOME_STRUCTURE1.pptsherylbadayos
 
Repetitive sequences in the eukaryotic genome
Repetitive sequences in the eukaryotic genomeRepetitive sequences in the eukaryotic genome
Repetitive sequences in the eukaryotic genomeStevenson Thabah
 
8 f forensic d n a analysis (student)
8 f forensic d n a analysis (student)8 f forensic d n a analysis (student)
8 f forensic d n a analysis (student)San Raj
 
genome structure and repetitive sequence.pdf
genome structure and repetitive sequence.pdfgenome structure and repetitive sequence.pdf
genome structure and repetitive sequence.pdfNetHelix
 
DNA Typing - Unit 1.pptx
DNA Typing - Unit 1.pptxDNA Typing - Unit 1.pptx
DNA Typing - Unit 1.pptxMayank Raiborde
 

Similar to CoE-WEBINAR-2_042117v3.pptx (20)

Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...
Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...
Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...
 
UNIQUE AND REPETITIVE DNA.a derailed presentation
UNIQUE AND REPETITIVE DNA.a derailed presentationUNIQUE AND REPETITIVE DNA.a derailed presentation
UNIQUE AND REPETITIVE DNA.a derailed presentation
 
Genetic Engineering by Kailash Sontakke Botany Sem-VI Unit-IV all
Genetic Engineering by Kailash Sontakke Botany Sem-VI Unit-IV allGenetic Engineering by Kailash Sontakke Botany Sem-VI Unit-IV all
Genetic Engineering by Kailash Sontakke Botany Sem-VI Unit-IV all
 
Lecture_3-Vectors.pdf
Lecture_3-Vectors.pdfLecture_3-Vectors.pdf
Lecture_3-Vectors.pdf
 
Microbial genomes.ppt
Microbial genomes.pptMicrobial genomes.ppt
Microbial genomes.ppt
 
Chromatin structure "DNA+CHROMOSOME"
Chromatin structure "DNA+CHROMOSOME"Chromatin structure "DNA+CHROMOSOME"
Chromatin structure "DNA+CHROMOSOME"
 
Molecular biology
Molecular biologyMolecular biology
Molecular biology
 
Recombinant DNA Technology
Recombinant DNA TechnologyRecombinant DNA Technology
Recombinant DNA Technology
 
Gene cloning
Gene cloningGene cloning
Gene cloning
 
1_7_genome_1.ppt
1_7_genome_1.ppt1_7_genome_1.ppt
1_7_genome_1.ppt
 
Advanced molecular biology.ppt
Advanced molecular biology.pptAdvanced molecular biology.ppt
Advanced molecular biology.ppt
 
GENOME_STRUCTURE1.ppt
GENOME_STRUCTURE1.pptGENOME_STRUCTURE1.ppt
GENOME_STRUCTURE1.ppt
 
Repetitive sequences in the eukaryotic genome
Repetitive sequences in the eukaryotic genomeRepetitive sequences in the eukaryotic genome
Repetitive sequences in the eukaryotic genome
 
Vntr marker
Vntr markerVntr marker
Vntr marker
 
8 f forensic d n a analysis (student)
8 f forensic d n a analysis (student)8 f forensic d n a analysis (student)
8 f forensic d n a analysis (student)
 
genome structure and repetitive sequence.pdf
genome structure and repetitive sequence.pdfgenome structure and repetitive sequence.pdf
genome structure and repetitive sequence.pdf
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
Cloning.ppt
Cloning.pptCloning.ppt
Cloning.ppt
 
U1 and U2 Exam Review from 28May
U1 and U2 Exam Review from 28MayU1 and U2 Exam Review from 28May
U1 and U2 Exam Review from 28May
 
DNA Typing - Unit 1.pptx
DNA Typing - Unit 1.pptxDNA Typing - Unit 1.pptx
DNA Typing - Unit 1.pptx
 

Recently uploaded

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 

Recently uploaded (20)

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 

CoE-WEBINAR-2_042117v3.pptx

  • 1. How Next Generation Sequencing (NGS) are Analyzed From Raw Data to Sequence Contigs Madhu Anand, DrPH NYSDOH Bureau of Communicable Disease Control E-mail: madhu.anand@health.ny.gov
  • 2. Objectives • Understand the process of obtaining a contiguous sequence • Describe the technology needed for conducting NGS • Understand differences between conducting NGS for viral vs bacterial vs eukaryotic organisms • Explain the difference between reference and de-novo based assemblies • Understand the quantitative measures used to assess the quality of sequencing data. • Differences between SNP and MLST analysis
  • 3. Overall NGS Process • Specimen • Broken up into fragments • From the fragments develop sequence reads • Sequence reads are assembled into contiguous reads • Contiguous reads are put together in order • These are then compared 3
  • 4. Why It is called NGS • Sanger Sequencing (First generation) • Most widely used sequencing technology for approximately 25 years • Next Generation Sequencing • “AKA” high-throughput sequencing • Includes most sequencing technologies that came after Sanger sequencing • Technologies • Second Generation (Short-read) • Illumina • Ion Torrent • 454 pyrosequencing (Legacy Technology) • Third Generation (Long-read) • Pac-Bio • Oxford Nanopore 4
  • 5. Sanger vs NGS • Both allow for whole genome sequencing • NGS allows for massively parallel sequencing of target genes • Both technology standards have their utility in genomics today • Cost • Sequence read length • Massively Parallel • Millions of fragments are sequenced in a single run vs one forward and reverse read 5
  • 6. Platform Characteristics Platform Read length (bp) Isolates per run (max) Run time Instrument cost Cost/ Mb Illumina HiSeq 2500 150 600-1000 5-11 d $740K $0.05 Illumina MiSeq 150, 250, 300 12-16 26 h, 36 h 65 h $99K $1.37 Illumina NextSeq 75, 150 96 29 h $250K 0.03- 0.07 IonTorrent PGM (314, 316, 318) 200, 400 1-10 2-8h $75K $0.93 - $7.5 Ion Proton 100-200 96 2-4 h $245K $0.02 PacBio RSII 10,000- 40,000 8 /smrt cell 0.5-2 h $750K $180.00 Sanger 650 96 1 h $100k $2800 Romer-Carleton
  • 7. Platforms and Pathogens • Read length • Size of genome being sequenced 7
  • 8. Viral Genome • Compact • ~10,000 nucleotides (nt) (typically, ~3,000–200,000) • Little wasted space • Variable composition • DNA; RNA • Single-stranded; double stranded • Linear; circular • Single; segmented • Often highly variable • Particularly true of ssRNA viruses • Quasispecies. Example: hepatitis C virus
  • 10. Bacterial Genome • Larger • ~5m nt (typically, ~2m–10m) • Pan-genome • Core genome: ~3,000–5,000 genes, present in most strains of a given species • Accessory genome: up to thousands of genes, not always present • Structure • dsDNA • Usually single, circular chromosome
  • 11. Bacterial Genome Complications • Complications • Plasmids • Circular, dsDNA structures that replicate independently from chromosome • Often carry resistance or virulence genes • Can be passed from one bacterium to another • Phages • Viruses that infect bacteria • Genome can integrate into chromosome • Other repetitive elements
  • 13. Eukaryotes • Huge • Billions of nucleotides long • Human genome: ~3B nt • Structure • dsDNA • Multiple chromosomes • Most of human genome has no apparent function • Introns: (intragenic region) non-coding segments (~99%) • Exons: (expressed region) coding segments (~1%)
  • 14. Viral vs. Bacterial vs. Eukaryotic Genome Summary • Viral genomes: compact, efficient organization. Size: 1,000s of nt. • Bacterial genomes: more complex, repetitive elements. Size: 1,000,000s of nt. • Eukaryotic genomes: mostly introns, 1,000,000,000s of nt.
  • 15. AMD – Innovate * Transform * Protect Relative Genome Size 1k 10kb 100kb 1Mb 10Mb 100Mb 1Gb 10Gb 100Gb Viruses Plasmids Archaea Bacteria Fungi Protists Plants Algae Nematodes Insects Mollusks Fish Amphibians Reptiles Birds Mammals (haploid bp) 3Gb
  • 16. Mobile Genetic Elements • Are sequences that can move around within and between genome. • The “mobilome” includes: • Plasmids • Transposons • Bacteriophages • In prokaryotes, there is horizontal gene transfer between organisms. • Mobile genetic elements enable horizontal gene transfer, and consequently are an important factor in acquired virulence, antimicrobial resistance and bacterial evolution. https://en.wikipedia.org/wiki/Mobile_genetic_elements Methods Mol Biol. 2009;532:13-27. doi: 10.1007/978-1-60327-853-9_2
  • 17. Mobile Genetic Components • Allows two organisms to become different very quickly • Resistance • Might have millions more nucleotides • Interpretation of DNA become difficult • Need to take these components into consideration • Do you keep them in analysis
  • 18. Plasmids • Small (1kb-200kb) extrachromosomal sequences that are circular and self- replicating. Varying copy number. • Present in both bacteria and eukaryotic cells. • Five functional categories of plasmids: • Fertility, Resistance, Col-plasmid, degradative, virulence. • Compatibility: similar plasmids usually can’t coexist. • Plasmids are an important mechanism of horizontal gene transfer, and play critical role in bacterial diversity. • HGT: AR, virulence, metabolism, other acquired features. • Used as a cloning vector for research • Can integrate/recombine with the host genome. Bourgogne 2003 doi: 10.1128/IAI.71.5.2736-2743.2003 Methods Mol Biol. 2009;532:13-27. doi: 10.1007/978-1-60327-853-9_2
  • 19. Bacteriophage • Bacteriophage are viruses that infect bacteria • Extremely diverse group of viruses • 5kb to 500kb ss/dsDNA or ss/dsRNA genome • Inject payload into the target bacteria, hijack its replicative machinery to reproduce. • Lysogenic phages: • Lysogenic phages do not cause immediate lysis of the cell, but instead incorporate into the host genome. • Viral payload may remain dormant as a prophage. This prophage may alter the phenotype of the cell in important ways (eg: p0157) • Excision of the phage is also an imperfect process and may include adjacent sequences from the host genome that can be “transmitted” by the phage to subsequent hosts.
  • 20. Transposons • A transposon, or transposable genetic element, can mediate its own movement within the genome or between hosts. • Common feature of most genomes, not just bacteria. • Encodes a transposase, which cuts the donor and target sequences, and performs strand exchange. • Targeting can be fairly indiscriminate – may knock out gene function. This property is actually used as a research tool. Tn10, Salmonella typhi 9147bp, encodes TetR Non-replicative cut and paste 5’ and 3’ end include IS and a pair putative transposons Tn10
  • 21. Sequencing Process Definitions • Sequence – generic name describing order of biological letters (DNA/RNA or amino acids). • Both reads and contigs are DNA/RNA or amino acid sequences • Reads – sequenced reads of base pairs as you are trying to assemble • Contigs - reads that have been assembled together; final product 21
  • 22. Why do we need to assemble? • Bacterial genomes range in size from as few as 3 to >5 million nucleotides (A, C, T, and G’s). • Genomes are broken into pieces of about 250 nucleotides to sequence • New technologies can read >10,000 nucleotides at once • A large percentage of the sequence encode proteins (these regions are called genes) • Those pieces then need to be aligned so they can be compared to each other 22
  • 23. Making an assemblage • Two different methods • Reference guided – When you have “map” of final product, and try to match your pieces together to look like final product • Jigsaw puzzle – you have picture and put pieces together • Lyrics of song – you know words to song, and put words in correct order • De Novo – Put pieces together by what “makes sense” • Jigsaw puzzle – you may not have picture, but can put pieces together by what fits together • Lyrics of song – you may not know words of song, but can figure out sentences from the words.
  • 24. 24 Example: De Novo Assembly Example: De Novo assembly of lyrics, using “reads” of 4 to 6 words (each word is a base pair/amino acid): Reads: 1: yeah there will be an 2: tomorrow let it be o will 3: let it be and though they 4: me speaking words of wisdom 5: let it be let 6: the night is cloudy there 7: be let it be let it 8: be whisper words of wisdom 9: on me shine until tomorrow 10: let it be let it be let 11: be and when the broken hearted 12: answer let it be and though 13: it be let it be 14: and though the night is 15: the broken hearted people living
  • 25. 25 Assembly “Pile-Up” … standing right in front of me… … let it be and in my hour of darkness is standing right in front of me… …wisdom let it be and in she is standing right in front … …wisdom let it be darkness she is of me… …wisdom let and in my hour of darkness she is standing in front of me… …wisdom let it be and in my hour of darkness she is standing right in front of me… … wisdom let it be let … wads of wisdom let it be let it be let … speaking words of wisdom let … speaking words of … speaking words of wisdom let it be trouble Mother Mary comes to me in times of trouble Mother Mary comes words of… myself if times of trouble Mother me speaking warts of… when I find myself in times Mary comes to me speaking words of… when I find myself in times of trouble Mother Mary comes to me speaking words of… aligned reads assembled contig coverage = 3 coverage = 4 (note errors in some reads)
  • 26. 26 The Problem of Repeats let it be let it it be let it be let words … let it be let it be it be speaking words … … of wisdom let it be it be let it be … of wisdom let it be let it be let it be speaking words … let it be let it be let it be it be let it be let be let it be let it words … let it be let it be it be let it be it be speaking words … … of wisdom let it be it be let it be it be let it be … of wisdom let it be let it be let it be let it be let it be speaking words … let it be let it it be speaking words … it be let it be let be let it be let it be let it be it be let it … of wisdom let it be it be let it be words … … of wisdom let it be let it be let it be let it be speaking words … ? ? ?
  • 27. 27 Final Assembly when I find myself in times of trouble Mother Mary comes to me speaking words of wisdom let it be and in my hour of darkness she is standing right in front of me speaking words of wisdom and though the night is cloudy there is still a light that shines on me shine until tomorrow let it be o will I make up to the sound of music Mother Mary comes to me speaking words of wisdom whisper words of wisdom let it be and when the broken hearted people living in the world agree there will be an answer let it be and though they may be parted there is still a chance that they will see there will be an answer 3 contigs: Many unused reads: be let it be let it be let it it be let it be let it be let it be let be let it be let be let it be it be let it be let
  • 28. 28 Reference-Guided (Mapped) Assembly Reference Sequence/Genome Low sequence coverage UNMAPPED READS 1. Sequences not present in the reference. 2. Plasmids or other extrachromosomal. 3. DNA Structural Variation/Rearrangement ADVANTAGES: Relatively fast, well- suited to highly-conserved genomes. DISADVANTAGES: Issues with high diversity, mobile elements Coverage 18X 1X Contig 1 Contig 2 Example software: BWA (https://github.com/lh3/bwa) breseq (https://github.com/barricklab/breseq)
  • 29. 29 PLASMID De Novo Assembly Contig 1 Contig 2 Contig 3 Contig 4 Contig 5 Contig 6 Contig 7 ADVANTAGES: Reference agnostic: assembles all the reads it can. Various algorithms. DISADVANTAGES: Doesn’t always get things right. Particularly with complex repeats. Example software: SPAdes (http://bioinf.spbau.ru/spades) List: https://en.wikipedia.org/wiki/Sequence_assembly#Available_assemblers
  • 30. AMD – Innovate * Transform * Protect Campylobacter: 29 contigs; 1.61Mbases; N50: 151kbases De Novo Assembly: Contigs
  • 31. 31 Improving and Assessing Assembly Quality • Assessing assemblies • N50 – The length of the smallest contig in the set that contains the fewest (largest) contigs whose combined length represents at least 50% of the assembly. Still used but should not be only method. Other metrics - L50, NG50, N90. • Contigs (largest, smallest, mean, median, size distribution) • Coverage Plots – low coverage areas, high coverage areas (collapsed repeats) • Quast – collection of assembly assessment tools with detailed reports (misassembly detection, structural variation) http://bioinformatics.oxfordjournals.org/content/29/8/1072.full.pdf • GAGE – assessment tool used by annual Assemblathon http://genome.cshlp.org/content/early/2012/01/12/gr.131383.111.full.pdf • Improving assembly • Higher Coverage • Paired end instead of unpaired reads • Longer read length • Read quality • Assemblers
  • 32. Longer vs Shorter Reads • Read length depends on instruments used • Depending on goal, one may be better than other • Shorter reads • Pro – economic • Con – repeats, missing areas • Long reads • Pro - provide better coverage and higher consensus • Con - expensive
  • 33. 33 Benefit of Longer Reads front of me speaking words of wisdom let it be let it be let it be let it be let it be whisper words of wisdom let it be and when the broken 33-word read:* font of me sporking wit of wisdom let it see let it be leg if be let is but let it be whimper words of doom set it be and then the broken Poor quality, 33-word read:* * analogous to long-read sequencing, such as PacBio or nanopore sequencing
  • 34. NGS Quality Control Metrics • Q-score or Phred Quality Score - A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing • Coverage- The average number of reads representing a given nucleotide in the reconstructed sequence. • Insert Size - is the length of the DNA (or RNA) that you want to sequence and that is "inserted" between the adapters (so adapters excluded). 34
  • 35. 35 The Alphabet Soup of Analysis – Q-score Quality scores - likelihood the base call is correct Phred – part of fastq file generated from sequencer that scores base call quality Q30 – the percentage of base calls that have a 1 in 1000 chance or less of being incorrect (Q20 – 1 incorrect in 100 base calls) indicates whether a base call is trustworthy and can be used in a hqSNP analysis
  • 36. Coverage Inter-Quartile Range (IQR): The IQR is the difference in sequencing coverage between the 75th and 25th percentiles of the histogram. A high IQR indicates high variation in coverage across the genome, while a low IQR reflects more uniform sequence coverage. 36 *Adapted from Illumina.com
  • 37. Evaluating Coverage • Mean (Mapped) Read Depth: The mean mapped read depth (or mean read depth) is the sum of the mapped read depths at each reference base position, divided by the number of known bases in the reference. • Raw Read Depth: This is the total amount of sequence data produced by the instrument (pre-alignment), divided by the reference genome size. 37
  • 39. What is a SNP? • Single Nucleotide Polymorphism (SNP) ATGTTCCTC sequence ATGTTGCTC reference *phylogentically informative differences • Insertion or Deletion (Indel) ATGTTCCCTC sequence ATGTTC-CTC reference *differences not used in hqSNP analysis
  • 40. Whole genome multilocus sequence typing (MLST) • Database is built from gene content representing a diverse selection of the genus/species of the organism being compared • Each unique gene is referred to as a “locus” – a locus may include the entire gene or a piece of the gene • Any changes – SNP, insertions, deletions – equals a new allele call for a locus • New alleles are named sequentially when encountered- not based on sequence Locus 1 ACTAGAGGGAAA ACTAGAGGCTAA ACT-GAGGGTAA allele 1 allele 2 allele 3 2 SNPs 1 indel/ 1 SNP
  • 41. Whole genome multilocus sequence typing (MLST) • Allows for simpler analysis and clear naming of subtypes • Performs comparison on a gene by gene level Isolate A Isolate B Isolate C Locus 1 (20 nt) 1 1 1 Locus 2 (100nt) 8 8 12 Locus 3 (5000nt) 5 5 2 Etc. Locus 2,005 (5nt) 4 4 4 wgMLST type A A B
  • 42. MLST Analysis • Faster than analyzing SNP differences • MLST is conducted twice • Comparisons made between allele calls made with short reads • Comparisons made between allele calls made with contigs • Eliminates machine noise that is present with either method when done independently 42
  • 43. SNP versus MLST Analysis • Both analyses conducted from raw data • For public health purposes, both correlate well • i.e the outermost branches of phylogenetic trees are almost identical • The two are not mutually exclusive • For some use cases MLST works better, others SNP works better 43
  • 44. Take Home Messages • NGS needs time, resources, and expertise • Need to understand • when NGS should be done, • what question you are trying to answer, and • the metrics that go into a quality read • SNP and MLST analysis can both be used in public health and outbreak investigation with largely similar results 44
  • 45. Acknowledgments • Centers for Disease Control and Prevention • Greg Armstrong • Peter Gerner-Shmidt • John Besser • Martin Wiedmann • Integrated Food Safety Centers of Excellence 45

Editor's Notes

  1. Good afternoon, my name is Madhu Anand and I will be presenting on how NGS are analyzed from raw data to sequence contigs. Before I start presenting, I wanted to thank CDC, Martin Wiedmann, and my CoE colleagues for a number of the slides I will be presenting today.
  2. After this presentation, we hope that you will understand the process of obtaining a contiguous sequence. We also hope that you will appreciate the varied methods of conducting next generation sequencing (NGS), and understand how there are differences between conducting NGS for viral vs bacterial pathogens. As we describe how to put contigs together, we will talk about how to do that in 2 different ways: through reference and denovo assemblies. Finally, although we will not go into the nitty gritty of quantitative measures, you should understand the there are quantitative measures to assess the quality of sequencing data.
  3. In the first webinar it was discussed what the NGS process is overall, and I wanted to quickly review. You have an isolate that undergoes some agitation and is broken up into fragments. From these fragments you develop what are called sequence reads. The sequence reads are then assembled into contiguous reads. Contiguous reads can then put together in order. Technically this is called a scaffold. And finally, they are compared to other contiguous reads that are of known etiologies (or sometimes just to see if they are the same or different without knowing the etiology).
  4. As an epidemiologist at the state level, I first started hearing of Next generation sequencing and Whole genome sequencing about 5 years ago. Local Health Departments in the area began asking me what the terms meant. It’s not until recently that I started to get a better handle on the nuances. The “father or mother” of sequencing is called “Sanger sequencing”, meaning it is the first generation. It is a DNA sequencing technology that has been used for approximately 25 years. This has given way to NGS (Next Generation Sequencing), also known as high-throughput sequencing. NGS is an umbrella that includes many different platforms. These platforms reside under either the short-read variety or the long-read variety. We usually consider the short-reads to be the second generation, and the long-reads to be the third generation. Under the second generation we have three different platforms: (1) the platforms that have been introduced by Illumina, (2) the ion torrent platform, and (3) 454 pyrosequencing. In the third generation we have two different platforms: (1) the Pacific-Bio and (2) the Oxford Nanopore
  5. Both Sanger and NGS allow for whole genome sequencing….they are just different technologies for doing it. There are several considerations when deciding with which platform to sequence. All technologies can do the job, but there are is a balance based on a variety of factors. Cost is factor when you start looking at switching platforms. Because Sanger has been well adopted, there is less start up cost than if you start incorporating a new technology. With Sanger sequencing, you can sequence a small amount, but the quality is much higher. The sequence read length is another differentiator between Sanger sequencing and NGS for most platforms. The Pacific Bio or the Oxford Nanopore both produce long reads, but they come at a price. With Sanger sequencing you can obtain reads of 7-800nt which are longer reads than Illumina, but shorter than Pac Bio. With NGS you have massively parallel sequencing occurring at one specific moment. Millions of fragments are sequenced in a single run versus with Sanger sequencing only one forward and one reverse read occur at at a given moment.
  6. From this illustration you can see that depending on the platform, there is great variation in read length, volume of isolates that can be run at one time, and cost. Depending on the question that needs answering, and the pathogen being tested, one of these platforms may be preferred over another. Epidemiologists should understand which platform their lab is using as NGS is adopted more readily in order to understand the limitations of what their laboratory can provide. I think epidemiologists ask their lab to get testing done for investigations (does not have to be for NGS), sometimes without really understanding the turn around time or the quality of what they might be getting….open communication between the two can be extremely helpful in setting up a program in your department…it can set expectations on how often specimens may be run, guide follow up questions for an investigation, and develop a workflow. For example, at the NYSDOH Wadsworth Center Laboratories, they have a nextseq where they are running all Salmonella, but need to wait until we have 96 isolates for a run. Previously they were using a MiSeq, that could be filled up faster and run more often…but was more expensive per read. As epidemiologists, we have a better understanding of how the platform may impact turn around time depending on the season. Also costs and adoption of technology changes very quickly so what may be expensive today, may not be tmmrw.
  7. Additional considerations for platforms are what pathogens are being sequenced….it is important to know the read length the platform can handle in relation to the size of the genome being sequenced
  8. Viral genomes are relatively compact at about 10 thousand nucleotides, and have little wasted space. They are variable in composition meaning they can consist of DNA or RNA, can be single stranded or double stranded, be linear or circular, or be single or segmented. They are often highly variable, meaning there are a lot of SNPs between vriuses – they evolve so fast that you can get different strains very quickly. This is particularly true of single stranded RNA viruses
  9. Here is an example of a viral genome – Hepatitis B. There are a variety of genes in a circular fashion. Each gene is a different length, the genes are coded in different directions, and multiple segments make up the genome.
  10. Bacterial genomes, on the other hand are larger, usually about 5 million nucleotides, and have a highly conserved core portion of about 3-5 thousand genes that are present in most strains of a given species. There is also an accessory genome which is made of up to thousands of genes which are not always present. So what are accessory genes’ function? They can still be very important, for in antibiotic resistance, for instance beta lactam resistance. The structure of the bacterial genome is double stranded DNA that is usually a single circular chromosome
  11. Complications occur in bacterial genomes with plasmids and phages. Plasmids are circular double stranded DNA that can replicate independently from the chromosome. They often carry resistance or virulence genes and can be passed from one bacterium to another. Phages are viruses that infect bacteria and the genome can integrate in the chromosome. There are also other repetitive elements that are hard to analyze because we don’t necessarily know where they go or how many times they are repeated. They are important because they can cause difficulty in putting together an assembly.
  12. Here is an illustration – of an E. Coli genome. Phages and plasmids – mobile elements that are not typically part of the core.
  13. Eukaryotes on the other hand are huge - they are billions of nucleotides large. The human genome is 3 billion nucleotides, so this gives you some perspective. The structure of eukaryotes is a double stranded DNA with multiple chromosomes. Most of the human genome has no apparent function, because they are composed predominantly of non-coding segments (introns), rather than coding segments (exons)
  14. In general, these are the differences in size between the 3. Viral genomes are compact with efficient organization and are the smallest. Bacterial genomes are more complex with repetitive elements and are larger. Eukaryotic genomes are huge and mostly introns.
  15. This illustration gives you an idea of relative genome size for a number of entities. Humans fall into about 3 billion nucleotides. (listeria is 3 million)
  16. I am going to talk for a few minutes about mobile genetic elements. Mobile genetic elements move around and within and between genomes…..they enable horizontal gene transfer and consequently are an important factor in acquired virulence, antimicrobial resistance and bacterial evolution.
  17. Mobile Genetic Elements are important for epidemiologists to understand as they can allow two organisms to become different very quickly and cause resistance. Because they can add from a few thousand nucleotides to potentially hundreds of thousands of nucleotides to a genome, the interpretation of DNA becomes difficult. Some plasmids can be more than a hundred thousand nucleotides long. For analysis purposes you need to take these components into consideration and determine whether to keep them in the final analysis.
  18. I am going to talk about some specific mobile genetic elements. The first being plasmids. Plasmids are small extrachromosomal sequences that are circular and self-replicating with a varying copy number. They are present in both bacteria and eukaryotic cells. They have a number of functions, some that are more important for outbreak investigations such as resistance and virulence. Similar plasmids usually cannot co-exist. Plasmids play an important role in horizontal gene transfer and play a critical role in bacterial diversity.
  19. Bacteriophages are another type of mobile genetic element. They are viruses that infect bacteria, and are extremely diverse. They can be composed of single stranded or double stranded DNA or RNA. They inject this DNA/RNA into the target bacteria and take control of the bacteria to reproduce. Lysogenic phages are similar in that they do not cause immediate degradation of the cell, but instead incorporate into the host genome. In this instance, the DNA/RNA may remain in a dormant state, but still alter the organism.
  20. Transposons are the final mobile genetic element I am going to speak of. They can mediate their own movement within the genome or between hosts. Transposons are a common feature of most genomes, not just bacteria. They encode an enzyme which cuts the genome at certain segments and can move and insert itself in a different area of the genome. By doing this, they may knock out a gene’s function. This process is actually used in research purposes.
  21. I have talked a bit about the machines needed to do WGS, and how there might be elements within a raw specimen that need to be taken into consideration for analysis purposes. In order to understand how an isolate gets to be analyzed we needed to talk about the machinery and the DNA/RNA differences of the raw product. But I have not talked about how you put all the pieces together. Before I delve in, let’s discuss definitions that are specific to the sequencing process. A sequence is the generic name describing the order of biological letter associated with the DNA/RNA or amino acids. You will hear the terms reads and contigs many times through this presentation. They are both sequences. Reads are sequenced reads of base pairs as you try to assemble. Contigs are reads that have been assembled together into a final product. Contigs is a term you will see a lot later on and this is simply a contiguous sequence of DNA
  22. Why do we need to assemble in the first place? Bacterial genomes are millions of nucleotides…those A, C, T, and Gs I remember from college biology. These genomes need to be broken into smaller pieces in order to process and sequence. The current technology looks at pieces of about 250 nucleotides to sequence. However, technology is very quickly evolving. Newer technologies can read greater than 10 thousand nucleotides at once. A large percentage of the sequence encodes proteins. The areas where proteins are encoded are called genes. The areas where the proteins are coded need to be aligned so they can be compared to each other.
  23. There are two different methods to putting an assembly together. The first is reference guided and is a little easier to explain. Imagine you have a map or a picture of the final product and you are trying to match your pieces together to look like the final product. Let me provide you some concrete examples: You are putting together a jigsaw puzzle – you have the picture of the final product of what the jigsaw puzzle will look like and you are putting the pieces together based on the colors and shapes of the final product image. Another example, is that you know the words an composition of a song and are able to put the words of the song into the correct order. The second version of making an assemblage is DeNovo – where you put the pieces together by what makes sense. Let’s take the previous examples: For the jigsaw puzzle – imagine you have lost the picture of the final product. Are you still able to put the jigsaw puzzle together? Yes, by looking at the shapes of the pieces and seeing what fits together. Likewise for lyrics of a song, you may not know the all or any the words of the song, but can figure out sentences from the words that you have been given.
  24. Let’s walk through DeNovo assembly of some lyrics This example indicates each word is a base pair. We put together 4-6 words into something that makes sense.
  25. We then try to line up all the reads so the words are lining up with each other. When we have them lined up with each other we can put together a final product – which is the assembled contig I want to mention one more thing – coverage is also important. As you can see in this middle section, the word right aligns three time out of 5, and the word standing aligns four times out of 5. A contig developed from reads which have a higher coverage give you a higher confidence that the contig was assembled correctly. However, by putting a contig together through De Novo, there is some inherent ability to make errors in reads and then throughout.
  26. There are also issues where there are multiple areas in a song with same lyrics and you need to figure out where to assemble and how many times. IN this example, the words let it be repeat many times throughout the song. The number of times they can be put together may be confusing - is it just once, or two, or three time.
  27. Once you put together the final assembly you also might have areas of the song where you didn’t have enough confidence they belonged (where you don’t have good coverage) so would not include. Scaffold
  28. When you conduct a reference guided assembly…For example – when you know that you are looking at a certain Salmonella serotype – You might find some unmapped reads that may occur for a variety of reasons: The sequences are not actually in the reference Plasmids may be present that are extrachromosomal and not part of the gene OR the DNA may be a variant or have rearranged The Advantage of conducting reference guided assembly is that it is relatively fast and well suited to highly conserved genomes with little change. The disadvantages include that there can be issues if the genome is highly diverse and if there are lots of mobile elements that may lead to unmapped reads.
  29. This is an example for when you conduct a DeNovo assembly – for example when you are not sure what pathogen you are looking at. Advantages include that DeNovo assembles all the reads it can and doesn’t eliminate any based on a reference. I won’t go into details but there are many algorithms out there to be able to do this. The disadvantages include that you do not always get things right, particularly with complex repeats as we saw earlier.
  30. In order to view the sequence information, there are computer programs the lab and bioinformaticians use. I have put up an example of how putting seq reads together may look.
  31. I wanted to spend a few minutes speaking about how to assess assembly and quality. As an epidemiologist, I have confidence that the clusters that are reported from the laboratory are based on sound methodologies. But what I did not know is that there are tools they can use to give them higher confidence in the accuracy of the reads and contigs that have been assembled. I won’t go into all of them, but they have measurements and metrics to assist them in making assemblies based on the size of these contigs, the coverage of the reads (especially in the repeat areas). There are also ways to improve assembly which include having higher coverage, having paired instead of unpaired reads, having a longer read length, ensuring read quality, and utilizing assemblers.
  32. Specific to read length – as initially discussed in this presentation, read length depends on the machines/instruments used. Depending on the goal of assembling (for example, whether to find clusters vs to identify the assembly), one may be better than the other. Shorter reads are usually economic, but may have more repeats and missing areas. Long reads have better coverage, but technology is more expensive. Long reads are typically not used in public health sequencing. For public health long reads are only used for niche purposes such as generating high quality assemblies for validating sequencing protocols. The reason why we don’t use long read sequencing for outbreak surveillance of foodborne pathogens is the cost and comparatively high error rate. This might be overcome by increasing coverage which in turn would make it even more expensive For most labs, the type of sequencing that is currently relevant is short read sequencing and the MiSeq is the instrument most labs are using to achieve this at this point in time.
  33. This is an example of a 33 word read High quality gets the lyric words and order correct. - Illumina sequence good ability to assemble However, with poor quality (still with long reads) you start to see misspellings with words but still have the general feel and get the order correct. – while this looks like it might give you poorer quality, the long read sequencing gives you a better ability to put sequence reads in the correct order. This is where Pac-Bio might come in. Take one rare Salmonella (i.e. Salmonella Livingstone where you might not have a good sequence already run). You can use PacBio – get better coverage, and use that Pac Bio genome as a reference against Illumina run where you can call SNPS. Benefit of the long reads is you get the order correct.
  34. Let’s discuss some QC metrics that are associated with the NGS instruments. One of the QC metrics that you hear about the most is the Q-score or Phred Quality Score. A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing. This is something that is universal across all the platforms. A “coverage” is another QC metric. The coverage is the average number of reads representing a given nucleotide in the reconstructed sequence. So after the sequences have been aligned to a reference genome, or without a reference genome (known as DeNovo assembly), you’ll be able to see how many redundancies there are for a specific base. We call that the coverage. If the coverage is minimal, then each base holds less weight when it comes to calling that specific base. If the coverage is very high, then you know if it’s adenine, it would be a high probability that adenine is the correct call. Another QC metric is the insert size, which is the length of the DNA that you want to sequence and that is "inserted" between the adapters
  35. One of the other tools that are available are quality scores. This is the likelihood that a base call is correct. Some quality scores are called Phreds – which is where the sequencer scores base call quality based on a control. This is broken out further by the percentage of base calls that have a 1 in X chance or less of being incorrect. For example a Q20 is 1 incorrect in 100 base calls. Q30 is 1 incorrect in 1000 base calls. Essentially this helps to determine whether a base call is trustworthy and can be used in a High Quality SNP (single nucleotide polymorphism) analysis.
  36. Next let’s describe coverage in a little bit more detail. Let’s start with the Inter-Quartile Range (IQR). The IQR is the difference in sequencing coverage between the 75th and 25th percentiles of the histogram. A high IQR indicates high variation in coverage across the genome, while a low IQR reflects more uniform sequence coverage. Below, in the diagram, you can see the two different scenarios. On the left you have a smaller IQR whereas on the right you have a more diverse or spread out IQR. On the left is ideally what you would like to see across your reads. On the right is a little bit too much variation, or a high IQR, which is less desirable.
  37. A little bit more information on coverage. The Mean (Mapped) Read Depth: The mean mapped read depth is the sum of the mapped read depths at each reference base position, divided by the number of known bases in the reference. The mean read depth metric indicates how many reads, on average, are less likely to be aligned a given reference base position. Raw Read Depth: This is the total amount of sequence data produced by the instrument (so this is pre-alignment), divided by the reference genome size. Although raw read depth is often provided by sequencing instrument vendors as a specification, it does not take into account the efficiency of the alignment process. If the large fraction of the raw sequencing reads are discarded during the alignment process, the post alignment mapped read depth can be significantly smaller than the raw read depth.
  38. Finally, the last QC metric we’re going to discuss in more detail is insert size. This image here is a great basic image to help show the insert size compared to the adapter regions on either end of the insert. So when you look at the total fragment for NGS platforms, you have the total fragment length that includes the adapters and the inserts, read 1 adapter, read 2 adapter, you have the insert size and you have the inner distance or the distance between read 1 and read 2.
  39. Now I mentioned the term SNP (single nucleotide polymorphism) earlier. A SNP is where there is a nucleotide difference in the sequence. Sometimes there are insertions or deletions, aka indels, which may look like SNP differences. For the most part, SNPS result in phylogenetically informative differences, where as indels may not. For analyses that look at high quality SNPS, indels are not taken into consideration.
  40. Whole genome MLST is another method for analysis. There are predefined loci (genes) which equate to short reads. For example, with Listeria there are X number of genes. When looking at just one of these loci, you can take a look at the actual read and look for differences. In this example, the first read is allele 1. When you look at the next read, there are 2 SNP differences. We would call that a different allele, and number it sequentially as allele 2. This is not named allele 2 because it has 2SPS, but because it is the next one where we saw a difference. We look at the next read and see that it has 1 indel and 1 SNP, and call it allele 3. Notice, that when it comes to MLST, we are including indels as a difference from the original read.
  41. Now we start to line up those short reads (or loci or genes). We look at each gene and compare against the same area across isolates. So for this example, when we are looking at locus 1, isolate a, b, and c are al the same. When we look at locus 2, isolate a and b are the same, but isolate c is not. We take a look to see if the sequence is the same across isolates or different. If they are different then they are called a different allele number. In this sense it doesn’t matter if there are 1 or 20 SNP differences within the gene (ie at locus 1), an isolate at that loci would just be counted as different.
  42. The benefit is that reads are put together, but they don’t need to be put in order, so analysis can be quicker. The Con is that this type of analysis may not be able to account for mobile elements which might be important for AR, etc. At the end of the day, it is a question of how much does horizontal gene transfer matter when trying to analyze for enteric surveillance and cluster investigation – little.
  43. In summary – The generation of sequence data that will be used by epidemiologists involves both hardware technology as well as fairly complex bioinformatics analyses. Epidemiologists will not have to know the details of the genome sequence assembly and SNP detection process but they should know that the specifics of the analysis affect conclusions and they should be prepared to ask questions about sequence data. SNP and MLST analysis both have a place in analysis – the two are not mutually exclusive – you do both analyses with the same raw data. For some cases MLST works better, for others SNP works better. In important cases it makes sense to use both approaches for data analysis. Just one final word….learning about WGS has been an evolution for me. It has been a process of being exposed to a different language, learning the terminology, and how to use it in the right context. It has taken years of hearing the same thing, putting it to practice, and then re-listening to get a more discrete understanding. All of you listening today may be at different stages in this same path and I encourage you to come back to not only this presentation, but the others as you get more of a stable footing of understanding.
  44. I would like to thank a number of partners
  45. Questions?