SlideShare a Scribd company logo
1 of 35
Download to read offline
Sebastian Schmeier
s.schmeier@massey.ac.nz
http://sschmeier.com/bioinf-workshop/
Next-generation sequencing and quality control:
An introduction
2016
Sebastian Schmeier
Overview
• Typical workflow of a genomics experiment
• Genome versus transcriptome
• DNA sequencing
• The FASTQ-file format
• Sources of sequencing errors
• Quality assessment of a sequencing run
• Pre-processing sequencing data
2
DNA sequencing technologies
Quality assessment of a sequencing run
Pre-processing sequencing data
Sebastian Schmeier
Learning outcomes
• Being able to describe what next generation sequencing is.
• Being able to describe the FastQ-sequence format and understand
what information it contains and how it can be used.
• Describe what the Phred-based quality score is.
• Be able to describe the sources of sequencing errors.
• Being able to compute, investigate and evaluate the quality of 

sequence data from a sequencing experiment.
• Be able to distinguish between a good and a bad sequencing run.
• Being able to describe the steps involved in cleaning sequencing data.
3
Sebastian Schmeier
Typical workflow of a genomics experiment
4
Scientific question
Experimental design
Run the experiment
Assess data quality
Analyse the data
This is where the biological
experiment happens
This is the bottleneck of the
whole experiment
The essential part to make all
downstream analysis work
This is the bottleneck of the 

whole experiment
This is where the biological 

experiment happens
The essential part to make all 

downstream analysis work
Meaningful results
Sebastian Schmeier
Genome versus transcriptome
Genome
The entirety of an organism's
ancestral information. It is encoded
either in DNA or, for many types of
viruses, in RNA.
Transcriptome
The set of all RNA molecules,
including messenger RNA, ribosomal
RNA, transfer RNA, and other non-
coding RNA produced in one or a
population of cells
5
Name	 Base	Pairs	
HIV	 9,749	 9.7kb	
E.Coli	 4,600,000	 4.6MB	
Yeast	 12,100,000	 12.1Mb	
Drosophila	 130,000,000	 130MB	
Homo	sapiens	 3,200,000,000	 3.2GB	
marbled	lungfish	 130,000,000,000	 130Gb	
"Amoeba"	dubia	 670,000,000,000	 670Gb
Sebastian Schmeier
DNA sequencing
DNA sequencing is the process 

of determining the nucleotide 

order of a given DNA fragment.
First-generation sequencing:
1977 Sanger sequencing method 

development (chain-termination 

method)
2001, Sanger method produced a 

draft sequence of the human genome
Next-generation sequencing (NGS)
Demand for low-cost sequencing has driven the development of high-
throughput sequencing (or NGS) technologies that parallelize the sequencing
process, producing thousands or millions of sequences concurrently
2004 454 Life Sciences marketed a parallelized version of pyrosequencing
6
Sebastian Schmeier
Result of a sequencing run
Short read sequences
• The result of NGS technology are a collection of short
nucleotide sequences (reads) of varying length
(~40-400nt) depending on the technology used to
generate the reads
• Usually a reads quality is good at the beginning of the read
and errors accumulate the longer the read gets 

=> IMPORTANT
7
Sebastian Schmeier
Illumina sequencing
MiSeq:
• Bench-top sequencer
• Produces around 30 million reads/run
• Reads are up to 250nt
HiSeq:
• Large-scale sequencer
• 4 billion reads/run
• Reads up to 150nt
The Illumina systems accumulate errors
towards the end of the read sequence.
8
http://www.illumina.com/
Sebastian Schmeier
Illumina sequencing (2)
• An Illumina flowcell is a surface to
which seq. adaptors are
covalently attached.
• DNA with complementary
adaptors is attached, clonally
amplified, and then sequenced
by synthesis
• Each flowcell is subdivided into

hundreds of tiles
9
Sebastian Schmeier
Illumina sequencing (3)
10Addressing challenges in the production and analysis of illumina sequencing data. Kirchner et al. BMC Genomics, 2011,12:382
Sebastian Schmeier
The FASTQ-file format
The file-format that you will encounter soon is called FASTQ
11
@EAS139:136:FC706VJ:2:2104:15343:197393	1:Y:18:ATCACG	
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC	
+		
‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>
Sebastian Schmeier
The FASTQ-file format (2)
The file-format that you will encounter soon is called FASTQ
12
@EAS139:136:FC706VJ:2:2104:15343:197393	1:Y:18:ATCACG	
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC	
+		
‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>
Sequence ID
Sebastian Schmeier
The FASTQ-file format (3)
The file-format that you will encounter soon is called FASTQ
13
@EAS139:136:FC706VJ:2:2104:15343:197393	1:Y:18:ATCACG	
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC	
+		
‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>
Sequence ID
Sequence
Sebastian Schmeier
The FASTQ-file format (4)
The file-format that you will encounter soon is called FASTQ
14
@EAS139:136:FC706VJ:2:2104:15343:197393	1:Y:18:ATCACG	
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC	
+		
‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>
Sequence ID
Sequence
Phred quality of the 

corresponding 

nucleotide (ASCII code)
Sebastian Schmeier
The FASTQ-file format (5)
The file-format that you will encounter soon is called FASTQ
15
@EAS139:136:FC706VJ:2:2104:15343:197393	1:Y:18:ATCACG	
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC	
+		
‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>
Sequence ID
Casava 1.8 the format
Sebastian Schmeier
Sequencing errors
• Sequencing errors or mis-called bases occur when a
sequencing method calls one or more bases incorrectly leading
to an incorrect read.
• The chance of a sequencing error is generally known and
quantifiable, thanks to extensive testing and calibration of the
sequencing machines
• Each base in a read is assigned a quality score, indicating
confidence that the base has been called correctly.
16https://www.broadinstitute.org/crd/wiki/index.php/Sequencing_error
Sebastian Schmeier
The FASTQ-file format (6)
• FastQ: Phred base quality
• One ASCII character per nucleotide.
• Encodes for a quality Q = -10*log10(P), where P is the error probability
17
@EAS139:136:FC706VJ:2:2104:15343:197393	1:Y:18:ATCACG	
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC	
+		
‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>
Phred quality of the 

corresponding 

nucleotide (ASCII code)
-10*log10(0.1) =
Sebastian Schmeier
Sources of sequencing errors
• The importance and the relative effect of each error source on
downstream applications depend on many factors, such as:
sample acquisition
reagents
tissue type
protocol
18The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62
instrumentation
experimental conditions
analytical application
the ultimate goal of the study.
Sebastian Schmeier
Sources of sequencing errors (2)
Sequencing errors can stem from any time point throughout the experimental
workflow, including initial sequence preparation, library preparation and sequencing.
19The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62
Sebastian Schmeier
Sources of sequencing errors (3)
20The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62
Sample preparation
• User errors; for example, mislabelling
• Degradation of DNA and/or RNA from
preservation methods; for example, tissue
autolysis, nucleic acid degradation and
crosslinking during the preparation of
formalin-fixed, paraffin-embedded (FFPE)
tissues
• Alien sequence contamination; for example,
those of mycoplasma and xenograft hosts
• Low DNA input
Sebastian Schmeier
Sources of sequencing errors (4)
21The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62
Library preparation
• User errors; for example, carry-over of DNA from one sample to
the next and contamination from previous reactions
• PCR amplification errors
• Primer biases; for example, binding bias, methylation bias, biases that
result from mispriming, nonspecific binding and the formation of
primer dimers, hairpins and interfering pairs, and biases that are
introduced by having a melting temperature that is too high or too
low
• 3ʹ-end capture bias that is introduced during poly(A) enrichment in
high-throughput RNA sequencing
• Private mutations; for example, those introduced by repeat regions
and mispriming over private variation
• Machine failure; for example, incorrect PCR cycling temperatures
• Chimeric reads
• Barcode and/or adaptor errors; for example, adaptor contamination,
lack of barcode diversity and incompatible barcodes
Sebastian Schmeier
Sources of sequencing errors (5)
22The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62
Sequencing and imaging
• User errors; for example, cluster crosstalk caused by
overloading the flow cell
• Dephasing; for example, incomplete extension and
addition of multiple nucleotides instead of a single
nucleotide
• ‘Dead’ fluorophores, damaged nucleotides and
overlapping signals
• Sequence context; for example, GC richness,
homologous and low-complexity regions, and
homopolymers
• Machine failure; for example, failure of laser, hard
drive, software and fluidics
• Strand biases
Sebastian Schmeier
Sources of sequencing errors (6)
It is important to remember that one get somatic (acquired) mutations
as well.These are not sequencing error but can be mistaken for them.
23The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62
Sebastian Schmeier
Assessing quality: Reads
• If the quality of the reads is bad we can trim the nucleotides that are bad
of the end of the reads
• Not trimming the end has a huge influence on downstream processes,
e.g. assemblies
24
Good	
Bad	
Trimming	needed	
Position in read Position in read
Quality
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Sebastian Schmeier 25
Assessing quality: Reads (2)
http://solexaqa.sourceforge.net/
Sebastian Schmeier
Assessing quality:Tiles
• One can assess also the quality of a run based on the tiles of a
lane of a flowcell
spot problems with a particular tile on a lane, e.g. Bubbles in
the reagents
• The homogeneity of the Illumina process ensures that the relative
frequencies are similar from tile to tile and distributed uniformly
across each tile
when the machine is functioning properly
• Major discrepancies in these conditions can be discerned by sight
• Many such discrepancies are small and their effects are limited to
one, or a few, tiles.
26
Sebastian Schmeier
Assessing quality:Tiles (2)
• Encoded in the FASTQ-file is the flowcell tile from which each read came.
• The graph allows you to look at the quality scores from each tile across all
of your bases to see if there was a loss in quality associated with only one
part of the flowcell.
27
Good run Bad run
http://solexaqa.sourceforge.net/
Sebastian Schmeier
Assessing quality:Tiles (3)
• The plot shows the deviation from the average quality for each tile.
• The colours are on a cold to hot scale
• Cold colours being positions where the quality was at or below the average for that
base in the run
• Hotter colours indicate that a tile had worse qualities than other tiles for that base.
28
Good run Bad run
http://solexaqa.sourceforge.net/
Sebastian Schmeier
Assessing quality:Tiles (4)
29
Good run
Bad run
http://solexaqa.sourceforge.net/
Sebastian Schmeier
Assessing quality: Data processing
• Adapter trimming: If not already done, we can remove the
adapter used for sequencing.
30
DNA sequence of interest
Universal adapter
Indexed adapter
6 base index region
Example for IlluminaTrueSeq
Sebastian Schmeier
Assessing quality: Data processing (2)
• Filtering:We can remove all reads that do not have a
particular quality over the read length, e.g. at least q20 for 80%
of the read
31
good quality
bad quality
Too many errors? Discard
X
Sebastian Schmeier
Assessing quality: Data processing (3)
• Cropping:We can try to remove all nt from both ends that do
not fulfil a certain quality
32
good quality
bad quality
Sebastian Schmeier
Assessing quality: Data processing (4)
• Removal: We can remove reads that are too short after
cropping.
33
good quality
bad quality
Fragment too short? Discard
X
Sebastian Schmeier
Assessing quality: Data processing (5)
• Adapter trimming: If not already done, we can remove the adapter used
for sequencing
• Filtering:We can remove all reads that do not have a particular quality over
the read length, e.g. at least q20 for 80% of the read
• Cropping:We can try to remove all nt from both ends that do not fulfil a
certain quality
• Removal: We can remove reads that are too short after cropping.
In the end, we work with an adjusted set of sequencing reads for which we
are more certain that they represent correct nt sequences from the genome
=> However filtering/trimming does not always improve things 

as we loose information
34
Sebastian Schmeier
s.schmeier@massey.ac.nz
http://sschmeier.com/bioinf-workshop/
References
The role of replicates for error mitigation in next-generation sequencing. Robasky et al.
Nature Reviews Genetics, 2014, 15, 56-62
Addressing challenges in the production and analysis of illumina sequencing data. Kirchner
et al. BMC Genomics, 2011,12:382

More Related Content

What's hot

Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Vijay Hemmadi
 
Open Reading Frames
Open Reading FramesOpen Reading Frames
Open Reading FramesOsama Zahid
 
Whole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisWhole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisdrelamuruganvet
 
Whole genome sequence
Whole genome sequenceWhole genome sequence
Whole genome sequencesababibi
 
Web based servers and softwares for genome analysis
Web based servers and softwares for genome analysisWeb based servers and softwares for genome analysis
Web based servers and softwares for genome analysisDr. Naveen Gaurav srivastava
 
sequence of file formats in bioinformatics
sequence of file formats in bioinformaticssequence of file formats in bioinformatics
sequence of file formats in bioinformaticsnadeem akhter
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing priyanka raviraj
 
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...VHIR Vall d’Hebron Institut de Recerca
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data Surya Saha
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignmentAfra Fathima
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 

What's hot (20)

Sequence Assembly
Sequence AssemblySequence Assembly
Sequence Assembly
 
BLAST
BLASTBLAST
BLAST
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
Genome assembly
Genome assemblyGenome assembly
Genome assembly
 
Nanopore sequencing .
Nanopore sequencing .Nanopore sequencing .
Nanopore sequencing .
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins
 
Open Reading Frames
Open Reading FramesOpen Reading Frames
Open Reading Frames
 
Dot matrix
Dot matrixDot matrix
Dot matrix
 
Whole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisWhole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysis
 
Whole genome sequence
Whole genome sequenceWhole genome sequence
Whole genome sequence
 
NGS File formats
NGS File formatsNGS File formats
NGS File formats
 
Web based servers and softwares for genome analysis
Web based servers and softwares for genome analysisWeb based servers and softwares for genome analysis
Web based servers and softwares for genome analysis
 
sequence of file formats in bioinformatics
sequence of file formats in bioinformaticssequence of file formats in bioinformatics
sequence of file formats in bioinformatics
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
Third Generation Sequencing
Third Generation Sequencing Third Generation Sequencing
Third Generation Sequencing
 
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data
 
Multiple sequence alignment
Multiple sequence alignmentMultiple sequence alignment
Multiple sequence alignment
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 

Viewers also liked

Next-generation sequencing course, part 1: technologies
Next-generation sequencing course, part 1: technologiesNext-generation sequencing course, part 1: technologies
Next-generation sequencing course, part 1: technologiesJan Aerts
 
New Generation Sequencing Technologies: an overview
New Generation Sequencing Technologies: an overviewNew Generation Sequencing Technologies: an overview
New Generation Sequencing Technologies: an overviewPaolo Dametto
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencingDayananda Salam
 
Quality Control of Sequencing Data
Quality Control of Sequencing DataQuality Control of Sequencing Data
Quality Control of Sequencing DataSurya Saha
 
Quality Control of NGS Data Solutions
Quality Control of NGS Data  SolutionsQuality Control of NGS Data  Solutions
Quality Control of NGS Data SolutionsSurya Saha
 
GTC group 8 - Next Generation Sequencing
GTC group 8 - Next Generation SequencingGTC group 8 - Next Generation Sequencing
GTC group 8 - Next Generation SequencingYanqi Chan
 
Examining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingExamining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingStephen Turner
 
Genome assembly: An Introduction (2016)
Genome assembly: An Introduction (2016)Genome assembly: An Introduction (2016)
Genome assembly: An Introduction (2016)Sebastian Schmeier
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialThomas Keane
 
Making your science powerful : an introduction to NGS experimental design
Making your science powerful : an introduction to NGS experimental designMaking your science powerful : an introduction to NGS experimental design
Making your science powerful : an introduction to NGS experimental designjelena121
 
Next Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesNext Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesChung-Tsai Su
 
NGS technologies - platforms and applications
NGS technologies - platforms and applicationsNGS technologies - platforms and applications
NGS technologies - platforms and applicationsAGRF_Ltd
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.mkim8
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsAnnelies Haegeman
 

Viewers also liked (20)

ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
Next-generation sequencing course, part 1: technologies
Next-generation sequencing course, part 1: technologiesNext-generation sequencing course, part 1: technologies
Next-generation sequencing course, part 1: technologies
 
New Generation Sequencing Technologies: an overview
New Generation Sequencing Technologies: an overviewNew Generation Sequencing Technologies: an overview
New Generation Sequencing Technologies: an overview
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
Ngs ppt
Ngs pptNgs ppt
Ngs ppt
 
Quality Control of Sequencing Data
Quality Control of Sequencing DataQuality Control of Sequencing Data
Quality Control of Sequencing Data
 
Quality Control of NGS Data Solutions
Quality Control of NGS Data  SolutionsQuality Control of NGS Data  Solutions
Quality Control of NGS Data Solutions
 
Promises and Challenges of Next Generation Sequencing for HIV and HCV
Promises and Challenges of Next Generation Sequencing for HIV and HCVPromises and Challenges of Next Generation Sequencing for HIV and HCV
Promises and Challenges of Next Generation Sequencing for HIV and HCV
 
GTC group 8 - Next Generation Sequencing
GTC group 8 - Next Generation SequencingGTC group 8 - Next Generation Sequencing
GTC group 8 - Next Generation Sequencing
 
Examining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingExamining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencing
 
Genome assembly: An Introduction (2016)
Genome assembly: An Introduction (2016)Genome assembly: An Introduction (2016)
Genome assembly: An Introduction (2016)
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing Tutorial
 
Making your science powerful : an introduction to NGS experimental design
Making your science powerful : an introduction to NGS experimental designMaking your science powerful : an introduction to NGS experimental design
Making your science powerful : an introduction to NGS experimental design
 
Next Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesNext Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and Opportunities
 
NGS technologies - platforms and applications
NGS technologies - platforms and applicationsNGS technologies - platforms and applications
NGS technologies - platforms and applications
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.
 
Ngs intro_v6_public
 Ngs intro_v6_public Ngs intro_v6_public
Ngs intro_v6_public
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platforms
 

Similar to Next-generation sequencing and quality control: An Introduction (2016)

Generations of sequencing technologies.
Generations of sequencing technologies. Generations of sequencing technologies.
Generations of sequencing technologies. ShadenAlharbi
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposalGenomeInABottle
 
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...Thermo Fisher Scientific
 
01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for education01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for educationaryajayakottarathil
 
DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification Senthil Natesan
 
CRISPR Screening: the What, Why and How
CRISPR Screening: the What, Why and HowCRISPR Screening: the What, Why and How
CRISPR Screening: the What, Why and HowHorizonDiscovery
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2Dan Gaston
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...QBiC_Tue
 
Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128GenomeInABottle
 
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...Mark Evans
 
20100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_020100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_0Computer Science Club
 
Aug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plansAug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plansGenomeInABottle
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingmikaelhuss
 
Whole exome sequencing data analysis.pptx
Whole exome sequencing data analysis.pptxWhole exome sequencing data analysis.pptx
Whole exome sequencing data analysis.pptxHaibo Liu
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 
Transcriptome Analysis & Applications
Transcriptome Analysis & ApplicationsTranscriptome Analysis & Applications
Transcriptome Analysis & Applications1010Genome Pte Ltd
 

Similar to Next-generation sequencing and quality control: An Introduction (2016) (20)

Generations of sequencing technologies.
Generations of sequencing technologies. Generations of sequencing technologies.
Generations of sequencing technologies.
 
140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal140127 abrf interlaboratory study proposal
140127 abrf interlaboratory study proposal
 
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
 
Seq 301116
Seq 301116Seq 301116
Seq 301116
 
01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for education01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for education
 
DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification
 
CRISPR Screening: the What, Why and How
CRISPR Screening: the What, Why and HowCRISPR Screening: the What, Why and How
CRISPR Screening: the What, Why and How
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2
 
Ngs introduction
Ngs introductionNgs introduction
Ngs introduction
 
BioSB meeting 2015
BioSB meeting 2015BioSB meeting 2015
BioSB meeting 2015
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...
 
Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128
 
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
XabTracker & SeqAgent: Integrated LIMS & Sequence Analysis Tools for Antibody...
 
20100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_020100509 bioinformatics kapushesky_lecture03-04_0
20100509 bioinformatics kapushesky_lecture03-04_0
 
Aug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plansAug2014 abrf interlaboratory study plans
Aug2014 abrf interlaboratory study plans
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
Whole exome sequencing data analysis.pptx
Whole exome sequencing data analysis.pptxWhole exome sequencing data analysis.pptx
Whole exome sequencing data analysis.pptx
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
Transcriptome Analysis & Applications
Transcriptome Analysis & ApplicationsTranscriptome Analysis & Applications
Transcriptome Analysis & Applications
 
132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaques132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaques
 

Recently uploaded

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxdhanalakshmis0310
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701bronxfugly43
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 

Recently uploaded (20)

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 

Next-generation sequencing and quality control: An Introduction (2016)

  • 2. Sebastian Schmeier Overview • Typical workflow of a genomics experiment • Genome versus transcriptome • DNA sequencing • The FASTQ-file format • Sources of sequencing errors • Quality assessment of a sequencing run • Pre-processing sequencing data 2 DNA sequencing technologies Quality assessment of a sequencing run Pre-processing sequencing data
  • 3. Sebastian Schmeier Learning outcomes • Being able to describe what next generation sequencing is. • Being able to describe the FastQ-sequence format and understand what information it contains and how it can be used. • Describe what the Phred-based quality score is. • Be able to describe the sources of sequencing errors. • Being able to compute, investigate and evaluate the quality of 
 sequence data from a sequencing experiment. • Be able to distinguish between a good and a bad sequencing run. • Being able to describe the steps involved in cleaning sequencing data. 3
  • 4. Sebastian Schmeier Typical workflow of a genomics experiment 4 Scientific question Experimental design Run the experiment Assess data quality Analyse the data This is where the biological experiment happens This is the bottleneck of the whole experiment The essential part to make all downstream analysis work This is the bottleneck of the 
 whole experiment This is where the biological 
 experiment happens The essential part to make all 
 downstream analysis work Meaningful results
  • 5. Sebastian Schmeier Genome versus transcriptome Genome The entirety of an organism's ancestral information. It is encoded either in DNA or, for many types of viruses, in RNA. Transcriptome The set of all RNA molecules, including messenger RNA, ribosomal RNA, transfer RNA, and other non- coding RNA produced in one or a population of cells 5 Name Base Pairs HIV 9,749 9.7kb E.Coli 4,600,000 4.6MB Yeast 12,100,000 12.1Mb Drosophila 130,000,000 130MB Homo sapiens 3,200,000,000 3.2GB marbled lungfish 130,000,000,000 130Gb "Amoeba" dubia 670,000,000,000 670Gb
  • 6. Sebastian Schmeier DNA sequencing DNA sequencing is the process 
 of determining the nucleotide 
 order of a given DNA fragment. First-generation sequencing: 1977 Sanger sequencing method 
 development (chain-termination 
 method) 2001, Sanger method produced a 
 draft sequence of the human genome Next-generation sequencing (NGS) Demand for low-cost sequencing has driven the development of high- throughput sequencing (or NGS) technologies that parallelize the sequencing process, producing thousands or millions of sequences concurrently 2004 454 Life Sciences marketed a parallelized version of pyrosequencing 6
  • 7. Sebastian Schmeier Result of a sequencing run Short read sequences • The result of NGS technology are a collection of short nucleotide sequences (reads) of varying length (~40-400nt) depending on the technology used to generate the reads • Usually a reads quality is good at the beginning of the read and errors accumulate the longer the read gets 
 => IMPORTANT 7
  • 8. Sebastian Schmeier Illumina sequencing MiSeq: • Bench-top sequencer • Produces around 30 million reads/run • Reads are up to 250nt HiSeq: • Large-scale sequencer • 4 billion reads/run • Reads up to 150nt The Illumina systems accumulate errors towards the end of the read sequence. 8 http://www.illumina.com/
  • 9. Sebastian Schmeier Illumina sequencing (2) • An Illumina flowcell is a surface to which seq. adaptors are covalently attached. • DNA with complementary adaptors is attached, clonally amplified, and then sequenced by synthesis • Each flowcell is subdivided into
 hundreds of tiles 9
  • 10. Sebastian Schmeier Illumina sequencing (3) 10Addressing challenges in the production and analysis of illumina sequencing data. Kirchner et al. BMC Genomics, 2011,12:382
  • 11. Sebastian Schmeier The FASTQ-file format The file-format that you will encounter soon is called FASTQ 11 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC + ‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>
  • 12. Sebastian Schmeier The FASTQ-file format (2) The file-format that you will encounter soon is called FASTQ 12 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC + ‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>> Sequence ID
  • 13. Sebastian Schmeier The FASTQ-file format (3) The file-format that you will encounter soon is called FASTQ 13 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC + ‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>> Sequence ID Sequence
  • 14. Sebastian Schmeier The FASTQ-file format (4) The file-format that you will encounter soon is called FASTQ 14 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC + ‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>> Sequence ID Sequence Phred quality of the 
 corresponding 
 nucleotide (ASCII code)
  • 15. Sebastian Schmeier The FASTQ-file format (5) The file-format that you will encounter soon is called FASTQ 15 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC + ‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>> Sequence ID Casava 1.8 the format
  • 16. Sebastian Schmeier Sequencing errors • Sequencing errors or mis-called bases occur when a sequencing method calls one or more bases incorrectly leading to an incorrect read. • The chance of a sequencing error is generally known and quantifiable, thanks to extensive testing and calibration of the sequencing machines • Each base in a read is assigned a quality score, indicating confidence that the base has been called correctly. 16https://www.broadinstitute.org/crd/wiki/index.php/Sequencing_error
  • 17. Sebastian Schmeier The FASTQ-file format (6) • FastQ: Phred base quality • One ASCII character per nucleotide. • Encodes for a quality Q = -10*log10(P), where P is the error probability 17 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC + ‘'*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>> Phred quality of the 
 corresponding 
 nucleotide (ASCII code) -10*log10(0.1) =
  • 18. Sebastian Schmeier Sources of sequencing errors • The importance and the relative effect of each error source on downstream applications depend on many factors, such as: sample acquisition reagents tissue type protocol 18The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62 instrumentation experimental conditions analytical application the ultimate goal of the study.
  • 19. Sebastian Schmeier Sources of sequencing errors (2) Sequencing errors can stem from any time point throughout the experimental workflow, including initial sequence preparation, library preparation and sequencing. 19The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62
  • 20. Sebastian Schmeier Sources of sequencing errors (3) 20The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62 Sample preparation • User errors; for example, mislabelling • Degradation of DNA and/or RNA from preservation methods; for example, tissue autolysis, nucleic acid degradation and crosslinking during the preparation of formalin-fixed, paraffin-embedded (FFPE) tissues • Alien sequence contamination; for example, those of mycoplasma and xenograft hosts • Low DNA input
  • 21. Sebastian Schmeier Sources of sequencing errors (4) 21The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62 Library preparation • User errors; for example, carry-over of DNA from one sample to the next and contamination from previous reactions • PCR amplification errors • Primer biases; for example, binding bias, methylation bias, biases that result from mispriming, nonspecific binding and the formation of primer dimers, hairpins and interfering pairs, and biases that are introduced by having a melting temperature that is too high or too low • 3ʹ-end capture bias that is introduced during poly(A) enrichment in high-throughput RNA sequencing • Private mutations; for example, those introduced by repeat regions and mispriming over private variation • Machine failure; for example, incorrect PCR cycling temperatures • Chimeric reads • Barcode and/or adaptor errors; for example, adaptor contamination, lack of barcode diversity and incompatible barcodes
  • 22. Sebastian Schmeier Sources of sequencing errors (5) 22The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62 Sequencing and imaging • User errors; for example, cluster crosstalk caused by overloading the flow cell • Dephasing; for example, incomplete extension and addition of multiple nucleotides instead of a single nucleotide • ‘Dead’ fluorophores, damaged nucleotides and overlapping signals • Sequence context; for example, GC richness, homologous and low-complexity regions, and homopolymers • Machine failure; for example, failure of laser, hard drive, software and fluidics • Strand biases
  • 23. Sebastian Schmeier Sources of sequencing errors (6) It is important to remember that one get somatic (acquired) mutations as well.These are not sequencing error but can be mistaken for them. 23The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62
  • 24. Sebastian Schmeier Assessing quality: Reads • If the quality of the reads is bad we can trim the nucleotides that are bad of the end of the reads • Not trimming the end has a huge influence on downstream processes, e.g. assemblies 24 Good Bad Trimming needed Position in read Position in read Quality http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  • 25. Sebastian Schmeier 25 Assessing quality: Reads (2) http://solexaqa.sourceforge.net/
  • 26. Sebastian Schmeier Assessing quality:Tiles • One can assess also the quality of a run based on the tiles of a lane of a flowcell spot problems with a particular tile on a lane, e.g. Bubbles in the reagents • The homogeneity of the Illumina process ensures that the relative frequencies are similar from tile to tile and distributed uniformly across each tile when the machine is functioning properly • Major discrepancies in these conditions can be discerned by sight • Many such discrepancies are small and their effects are limited to one, or a few, tiles. 26
  • 27. Sebastian Schmeier Assessing quality:Tiles (2) • Encoded in the FASTQ-file is the flowcell tile from which each read came. • The graph allows you to look at the quality scores from each tile across all of your bases to see if there was a loss in quality associated with only one part of the flowcell. 27 Good run Bad run http://solexaqa.sourceforge.net/
  • 28. Sebastian Schmeier Assessing quality:Tiles (3) • The plot shows the deviation from the average quality for each tile. • The colours are on a cold to hot scale • Cold colours being positions where the quality was at or below the average for that base in the run • Hotter colours indicate that a tile had worse qualities than other tiles for that base. 28 Good run Bad run http://solexaqa.sourceforge.net/
  • 29. Sebastian Schmeier Assessing quality:Tiles (4) 29 Good run Bad run http://solexaqa.sourceforge.net/
  • 30. Sebastian Schmeier Assessing quality: Data processing • Adapter trimming: If not already done, we can remove the adapter used for sequencing. 30 DNA sequence of interest Universal adapter Indexed adapter 6 base index region Example for IlluminaTrueSeq
  • 31. Sebastian Schmeier Assessing quality: Data processing (2) • Filtering:We can remove all reads that do not have a particular quality over the read length, e.g. at least q20 for 80% of the read 31 good quality bad quality Too many errors? Discard X
  • 32. Sebastian Schmeier Assessing quality: Data processing (3) • Cropping:We can try to remove all nt from both ends that do not fulfil a certain quality 32 good quality bad quality
  • 33. Sebastian Schmeier Assessing quality: Data processing (4) • Removal: We can remove reads that are too short after cropping. 33 good quality bad quality Fragment too short? Discard X
  • 34. Sebastian Schmeier Assessing quality: Data processing (5) • Adapter trimming: If not already done, we can remove the adapter used for sequencing • Filtering:We can remove all reads that do not have a particular quality over the read length, e.g. at least q20 for 80% of the read • Cropping:We can try to remove all nt from both ends that do not fulfil a certain quality • Removal: We can remove reads that are too short after cropping. In the end, we work with an adjusted set of sequencing reads for which we are more certain that they represent correct nt sequences from the genome => However filtering/trimming does not always improve things 
 as we loose information 34
  • 35. Sebastian Schmeier s.schmeier@massey.ac.nz http://sschmeier.com/bioinf-workshop/ References The role of replicates for error mitigation in next-generation sequencing. Robasky et al. Nature Reviews Genetics, 2014, 15, 56-62 Addressing challenges in the production and analysis of illumina sequencing data. Kirchner et al. BMC Genomics, 2011,12:382