ERRORS & DRAWBACKS
OF NGS
Nixon Mendez
Department of Bioinformatics
Introduction
 High throughput sequencing technologies has made whole genome
sequencing and resequencing available to many more researchers and
projects.
 Cost and time have been greatly reduced.
 The error profiles and limitations of the new platforms differ significantly
from those of previous sequencing technologies.
 The selection of an appropriate sequencing platform for particular types of
experiments is an important consideration.
 Requires a detailed understanding of the technologies available which
including sources of error, error rate, as well as the speed and cost of
sequencing.
Errors in NGS
Errors in NGS
NGS sequencing errors focuses mainly on the following
points:
1. Low quality bases
2. PCR errors
3. High Error rate
1. Low quality bases
1. All the NGS companies have made big strides in improving the raw
accuracy of the bases.
2. Read lengths have increased as a result.
3. The number of reads has also increased to the point to get high
enough coverage to rule out most issues with low quality base calls.
2. PCR errors
All of the current NGS systems use PCR in some form to amplify the
initial nucleic acid and to add adapters for sequencing.
1. The amount of amplification can be very high, with multiple rounds
of PCR for exome and/or amplicon applications.
2. That base differences are seen which were artefacts generated by
the PCR.
3. Several groups have published improved methods that reduce the
amount of PCR or use alternative enzymes to increase the fidelity of
the reaction, e.g. Quail et al.
3. High error rate
1. High error rate prevents the accurate detection of rare mutations in
heterogeneous populations such as tumors and microbiomes.
Limitations of NGS
Limitations of NGS
NGS has inherent limitations they are as follows :
1. Sequence properties and algorithmic challenges
2. Contamination or new insertions
3. Repeat content
4. Segmental duplications
5. Missing and fragmented genes
6. Reference index
1. Sequence properties and
algorithmic challenges
 NGS technologies typically generate shorter sequences with higher
error rates from relatively short insert libraries.
 Illumina’s sequencing by synthesis, routinely produces read lengths of
75–100 base pairs (bp) from libraries with insert sizes of 200–500 bp.
 Short read lengths of NGS prevent the assembly of genomes with long
stretches of repetitive DNA.
2. Contamination or new insertions
 An important consideration of any sequencing project is DNA
contamination from other organisms.
 Before analyzing the genomes are searched for possible contaminants
by comparing the genome against (NCBI) nucleotide (nt) database.
 De novo sequence assemblies may be an important source for the
discovery of insertion polymorphisms sequence which require
particular scrutiny and additional validation because of their tendency
to enrich for contamination artifacts.
 Discriminating such sequences before sequence assembly becomes
particularly problematic when the underlying sequence read data are
short.
3. Repeat content
 Any WGS-based sequence assembly algorithm will collapse identical
repeats, resulting in reduced or lost genomic complexity.
 Most Alu subfamilies were underrepresented because of the shorter
sequence length of the Alu repeat class.
 Most common repeat classes showed reduced representation in the YH
genome.
4. Segmental duplications
 Whole-Genome Assembly Comparison (WGAC) method is used to analyse
the segmental duplication.
 If we limit our analysis to those duplications commonly present in the
human reference genome and duplications we detected through read-
depth analysis of a capillary sequencing–based WGS dataset (Celera) and
YH we conclude that 99.4% of true pairwise segmental duplications were
absent.
 We predict that 95.6% of the duplications in the YH de novo assembly are
likely false because they did not correspond to duplications predicted by
read depth.
5. Missing and fragmented genes
 Genomic reduction impacted on both gene coverage and
fragmentation of genes into multiple scaffolds.
 The presence of duplicated and repetitive sequences in introns
complicates complete gene assembly and annotation, leading to genes
being broken among multiple sequence scaffolds.
6. Reference index
 Other problem is analysing genomes without a reference index
genome.
 The portions that are missing or misassembled cannot be readily
inferred and are invisible to the biologist.
 Biases against duplications and repeats, as well as fragmentation,
raise questions related to the accuracy and completeness of similarly
assembled genomes.
Overcoming the Limitations
 It is the responsibility of the scientific community to enforce
standards of quality that can be measured and assessed.
 It is critical to develop new hybrid sequencing approaches, such as
multiplatform strategies including the third generation long-read
technologies, high-quality finished long-insert clones and new
assembly algorithms that can accommodate these heterogeneous
datasets.
 The genome assemblies themselves must be experimentally validated.
 Large-molecule, high-quality sequencing should not be abandoned
until the balance between quantity and quality of genomes has been
re-established.
THANK YOU!!

Errors and Limitaions of Next Generation Sequencing

  • 1.
    ERRORS & DRAWBACKS OFNGS Nixon Mendez Department of Bioinformatics
  • 2.
    Introduction  High throughputsequencing technologies has made whole genome sequencing and resequencing available to many more researchers and projects.  Cost and time have been greatly reduced.  The error profiles and limitations of the new platforms differ significantly from those of previous sequencing technologies.  The selection of an appropriate sequencing platform for particular types of experiments is an important consideration.  Requires a detailed understanding of the technologies available which including sources of error, error rate, as well as the speed and cost of sequencing.
  • 3.
  • 4.
    Errors in NGS NGSsequencing errors focuses mainly on the following points: 1. Low quality bases 2. PCR errors 3. High Error rate
  • 5.
    1. Low qualitybases 1. All the NGS companies have made big strides in improving the raw accuracy of the bases. 2. Read lengths have increased as a result. 3. The number of reads has also increased to the point to get high enough coverage to rule out most issues with low quality base calls.
  • 6.
    2. PCR errors Allof the current NGS systems use PCR in some form to amplify the initial nucleic acid and to add adapters for sequencing. 1. The amount of amplification can be very high, with multiple rounds of PCR for exome and/or amplicon applications. 2. That base differences are seen which were artefacts generated by the PCR. 3. Several groups have published improved methods that reduce the amount of PCR or use alternative enzymes to increase the fidelity of the reaction, e.g. Quail et al.
  • 7.
    3. High errorrate 1. High error rate prevents the accurate detection of rare mutations in heterogeneous populations such as tumors and microbiomes.
  • 8.
  • 9.
    Limitations of NGS NGShas inherent limitations they are as follows : 1. Sequence properties and algorithmic challenges 2. Contamination or new insertions 3. Repeat content 4. Segmental duplications 5. Missing and fragmented genes 6. Reference index
  • 10.
    1. Sequence propertiesand algorithmic challenges  NGS technologies typically generate shorter sequences with higher error rates from relatively short insert libraries.  Illumina’s sequencing by synthesis, routinely produces read lengths of 75–100 base pairs (bp) from libraries with insert sizes of 200–500 bp.  Short read lengths of NGS prevent the assembly of genomes with long stretches of repetitive DNA.
  • 11.
    2. Contamination ornew insertions  An important consideration of any sequencing project is DNA contamination from other organisms.  Before analyzing the genomes are searched for possible contaminants by comparing the genome against (NCBI) nucleotide (nt) database.  De novo sequence assemblies may be an important source for the discovery of insertion polymorphisms sequence which require particular scrutiny and additional validation because of their tendency to enrich for contamination artifacts.  Discriminating such sequences before sequence assembly becomes particularly problematic when the underlying sequence read data are short.
  • 12.
    3. Repeat content Any WGS-based sequence assembly algorithm will collapse identical repeats, resulting in reduced or lost genomic complexity.  Most Alu subfamilies were underrepresented because of the shorter sequence length of the Alu repeat class.  Most common repeat classes showed reduced representation in the YH genome.
  • 13.
    4. Segmental duplications Whole-Genome Assembly Comparison (WGAC) method is used to analyse the segmental duplication.  If we limit our analysis to those duplications commonly present in the human reference genome and duplications we detected through read- depth analysis of a capillary sequencing–based WGS dataset (Celera) and YH we conclude that 99.4% of true pairwise segmental duplications were absent.  We predict that 95.6% of the duplications in the YH de novo assembly are likely false because they did not correspond to duplications predicted by read depth.
  • 14.
    5. Missing andfragmented genes  Genomic reduction impacted on both gene coverage and fragmentation of genes into multiple scaffolds.  The presence of duplicated and repetitive sequences in introns complicates complete gene assembly and annotation, leading to genes being broken among multiple sequence scaffolds.
  • 15.
    6. Reference index Other problem is analysing genomes without a reference index genome.  The portions that are missing or misassembled cannot be readily inferred and are invisible to the biologist.  Biases against duplications and repeats, as well as fragmentation, raise questions related to the accuracy and completeness of similarly assembled genomes.
  • 16.
    Overcoming the Limitations It is the responsibility of the scientific community to enforce standards of quality that can be measured and assessed.  It is critical to develop new hybrid sequencing approaches, such as multiplatform strategies including the third generation long-read technologies, high-quality finished long-insert clones and new assembly algorithms that can accommodate these heterogeneous datasets.  The genome assemblies themselves must be experimentally validated.  Large-molecule, high-quality sequencing should not be abandoned until the balance between quantity and quality of genomes has been re-established.
  • 17.