Estimation of sequencing error rates presentEstimation of sequencing error rates present
in genome databasesin genome databases
Valeriya SimeonovaValeriya Simeonova11
, Ivan Popov, Ivan Popov22
, Dimitar Vassilev, Dimitar Vassilev1*1*
1 - Faculty of Mathematics and Informatics, Sofia University "St. Kliment Ohridski", Sofia, Bulgaria1 - Faculty of Mathematics and Informatics, Sofia University "St. Kliment Ohridski", Sofia, Bulgaria
2 - Agro Bio Institute, Bioinformatics group, Sofia, Bulgaria2 - Agro Bio Institute, Bioinformatics group, Sofia, Bulgaria
1* - corresponding author:1* - corresponding author: jim6329@gmail.comjim6329@gmail.com
AbstractAbstract
Next - generation sequencingNext - generation sequencing
Validation of sequencesValidation of sequences
Donor/Acceptor sites - GT/AGDonor/Acceptor sites - GT/AG
NCBI as primary DB for information scanningNCBI as primary DB for information scanning
IntroductionIntroduction
To measure the quality of sequencing, one needs a stretch of DNA/RNA withTo measure the quality of sequencing, one needs a stretch of DNA/RNA with
high conservation, in which it is statistically very unlikely to find a variation. Suchhigh conservation, in which it is statistically very unlikely to find a variation. Such
sequences found in all eukaryotes are the splicing site’s donor and acceptor pairs.sequences found in all eukaryotes are the splicing site’s donor and acceptor pairs.
Donor - Acceptor Pairs:Donor - Acceptor Pairs:
if not reverse complement then: GT or GC vs. AGif not reverse complement then: GT or GC vs. AG
if reverse complement then: CT vs. AC or GCif reverse complement then: CT vs. AC or GC
Counting of reverse complement sites:Counting of reverse complement sites:
as CT is the RC of AG, it will be counted as AGas CT is the RC of AG, it will be counted as AG
as AC is the RC of GT, it will be counted as GTas AC is the RC of GT, it will be counted as GT
GC’s RC is GCGC’s RC is GC
Materials and methodsMaterials and methods
The NCBI Genome entries for the Oryza sativa chromosomes were used to estimateThe NCBI Genome entries for the Oryza sativa chromosomes were used to estimate
the sequencing error in the splicing donor/acceptor sites. The classical form of thethe sequencing error in the splicing donor/acceptor sites. The classical form of the
splicing sites (GT/GC - AG) was used for the analysis. Only variations in thissplicing sites (GT/GC - AG) was used for the analysis. Only variations in this
conservativeconservative Error rate by chromosome sequence were considered and any rareError rate by chromosome sequence were considered and any rare
splicing sites (AT/AC) [2] found were not taken into account.splicing sites (AT/AC) [2] found were not taken into account.
An alternative sequence of the rice genome was obtained from the Plant GenomeAn alternative sequence of the rice genome was obtained from the Plant Genome
Database [1]. It was used to verify the splicing errors in the NCBI sequence. TheDatabase [1]. It was used to verify the splicing errors in the NCBI sequence. The
positions of the Intron - Exon boundaries were taken from the annotation of thepositions of the Intron - Exon boundaries were taken from the annotation of the
NCBI Nucleotide entries ofNCBI Nucleotide entries of the chromosomes.The respective boundaries in thethe chromosomes.The respective boundaries in the
PGDB genome were selected by local pairwise alignment (BLAST) of thePGDB genome were selected by local pairwise alignment (BLAST) of the
chromosomes of the two retrieved genomes. The fragments that did not enter the bestchromosomes of the two retrieved genomes. The fragments that did not enter the best
BLAST result were ignored. We estimate the sequencing errors by calculating theBLAST result were ignored. We estimate the sequencing errors by calculating the
frequency of appearance of sites that do not match the canonical form.frequency of appearance of sites that do not match the canonical form.
Results and DiscussionResults and Discussion
12 Chromosomes,12 Chromosomes, 225 981225 981 donor-donor-
acceptor sites checked,acceptor sites checked, 33853385 differencesdifferences
were found from the classical formwere found from the classical form
This leads to an error rate ofThis leads to an error rate of 1.501.50 xx 10-10-
22. This is three orders of magnitude higher. This is three orders of magnitude higher
than the estimated error rate by Wesche etthan the estimated error rate by Wesche et
al. [3] for the referent mouse genome (wholeal. [3] for the referent mouse genome (whole
genome shotgun sequence of the C57BL/6Jgenome shotgun sequence of the C57BL/6J
line), and one order of magnitude higherline), and one order of magnitude higher
than the estimated error for codingthan the estimated error for coding
sequences in the Genbank records of mousesequences in the Genbank records of mouse
genes.genes.
Chart 1Chart 1
Results based only onResults based only on
NCBI dataNCBI data
These results slightly differs from previous, but they showThese results slightly differs from previous, but they show
us some inside information about the genome. We could analyzeus some inside information about the genome. We could analyze
errors’ differences in genomes and predict what error we coulderrors’ differences in genomes and predict what error we could
expect from sequencing other organism classified in certainexpect from sequencing other organism classified in certain
group (plants, animal groups, etc.). The same manner could begroup (plants, animal groups, etc.). The same manner could be
used for examining (verifying) results from NGS.used for examining (verifying) results from NGS.
We analyzedWe analyzed 1212 chromosomes and discoveredchromosomes and discovered 36843684
differences fromdifferences from 226 270.226 270.
Chart 2: Statistics about error by Chromosomes ifChart 2: Statistics about error by Chromosomes if
the error in Genome for every site group is 100%the error in Genome for every site group is 100%
Assuming: Every site group (GT/GC andAssuming: Every site group (GT/GC and
AG) results its’ error for Genome, andAG) results its’ error for Genome, and
this is 100%this is 100%
The two groups are not going to have theThe two groups are not going to have the
same trend linessame trend lines11
..
In the same time: as the chromosome isIn the same time: as the chromosome is
bigger, the errors are going up too.bigger, the errors are going up too.
It means that Chromosome 1 is producedIt means that Chromosome 1 is produced
15.08 % error level about GTC genome15.08 % error level about GTC genome
error group.error group.
1 - Trend lines’ type is Moving Average1 - Trend lines’ type is Moving Average
Chart 3: Stats about the error for each site in eachChart 3: Stats about the error for each site in each
ChromosomeChromosome
Assuming: Every site group (GT/GC andAssuming: Every site group (GT/GC and
AG) results its’ error for eachAG) results its’ error for each
Chromosome, and this is 100%. So here weChromosome, and this is 100%. So here we
show the error for each site group in eachshow the error for each site group in each
chromosome.chromosome.
The two groups are going to have theThe two groups are going to have the
similar trend linessimilar trend lines11
..
In the same time: it is evident that errorIn the same time: it is evident that error
level in AG is more than the error level inlevel in AG is more than the error level in
GT/GC in relative sense.GT/GC in relative sense.
It means that in Chromosome 1 for everyIt means that in Chromosome 1 for every
1000 GTC sites will be produced error about1000 GTC sites will be produced error about
16 wrong sites16 wrong sites
1 - Trend lines’ type is Polinomial1 - Trend lines’ type is Polinomial
Chart 4: Statistics about error in Chromosomes ifChart 4: Statistics about error in Chromosomes if
each Chromosome is 100%each Chromosome is 100%
Assuming: Both site groups (GT/GC andAssuming: Both site groups (GT/GC and
AG) results the error level for eachAG) results the error level for each
Chromosome, and this is 100%Chromosome, and this is 100%
The trend lineThe trend line11
of error level and the trendsof error level and the trends
from Chart show us which site group isfrom Chart show us which site group is
resulting more high level errors than theresulting more high level errors than the
other for each chromosome.other for each chromosome.
In the same time: there is no matter howIn the same time: there is no matter how
much bps there are in the chromosome.much bps there are in the chromosome.
It means that in Chromosome 1 for everyIt means that in Chromosome 1 for every
10000 sites will be produced error about 17610000 sites will be produced error about 176
sites.sites.
This chart also shows how much differsThis chart also shows how much differs
these results from the analyze withthese results from the analyze with
verifying genome with PlantGDB . It isverifying genome with PlantGDB . It is
important when we are going to examineimportant when we are going to examine
sequenced and assembled data by differentsequenced and assembled data by different
methods.methods.
1 - Trend lines’ type is Moving Average1 - Trend lines’ type is Moving Average
Charts 5: Stats about the error if whole Genome isCharts 5: Stats about the error if whole Genome is
100% - error and no error occurrence100% - error and no error occurrence
Assuming: The whole GenomeAssuming: The whole Genome
is 100%. Here are shown theis 100%. Here are shown the
two groups NE (errors) andtwo groups NE (errors) and
EQ (no errors) for eachEQ (no errors) for each
chromosome. So their sum ischromosome. So their sum is
100%100%
The two groups are going toThe two groups are going to
have similar trend lineshave similar trend lines11
..
In the same time: as theIn the same time: as the
chromosome is bigger, thechromosome is bigger, the
rates are going up too.rates are going up too.
It means that forIt means that for
Chromosome 1 the error isChromosome 1 the error is
0.26% based on whole Genome,0.26% based on whole Genome,
incl. no error sites.incl. no error sites.
1 - Trend lines’ type is Moving Average1 - Trend lines’ type is Moving Average
ReferencesReferences
Duvick, J., Fu, A., Muppirala, U., Sabharwal, M., Wilkerson, M.D., Lawrence, C.J.,Duvick, J., Fu, A., Muppirala, U., Sabharwal, M., Wilkerson, M.D., Lawrence, C.J.,
Lushbough, C. & Brendel, V. (2008) PlantGDB: a resource for comparative plantLushbough, C. & Brendel, V. (2008) PlantGDB: a resource for comparative plant
genomics. Nucl. Acids Res. 36, D959-D965.genomics. Nucl. Acids Res. 36, D959-D965.
Hall, S.L., Padgett, R.A. (1994). Conserved sequences in a class of rare eukaryoticHall, S.L., Padgett, R.A. (1994). Conserved sequences in a class of rare eukaryotic
nuclear introns with non-consensus splicesites. J. Mol. Biol. 239 (3): 357–65.nuclear introns with non-consensus splicesites. J. Mol. Biol. 239 (3): 357–65.
Wesche, P.L., Gaffney, D.J., Keightley, P.D. (2004) DNA sequence error rates inWesche, P.L., Gaffney, D.J., Keightley, P.D. (2004) DNA sequence error rates in
Genbank records estimated using the mouse genome as reference. DNA sequenceGenbank records estimated using the mouse genome as reference. DNA sequence
15(5/6): 362-64.15(5/6): 362-64.
Thank YouThank You
Presented by: Valeriya SimeonovaPresented by: Valeriya Simeonova

Sofia 19.06.2011 bio math

  • 1.
    Estimation of sequencingerror rates presentEstimation of sequencing error rates present in genome databasesin genome databases Valeriya SimeonovaValeriya Simeonova11 , Ivan Popov, Ivan Popov22 , Dimitar Vassilev, Dimitar Vassilev1*1* 1 - Faculty of Mathematics and Informatics, Sofia University "St. Kliment Ohridski", Sofia, Bulgaria1 - Faculty of Mathematics and Informatics, Sofia University "St. Kliment Ohridski", Sofia, Bulgaria 2 - Agro Bio Institute, Bioinformatics group, Sofia, Bulgaria2 - Agro Bio Institute, Bioinformatics group, Sofia, Bulgaria 1* - corresponding author:1* - corresponding author: jim6329@gmail.comjim6329@gmail.com
  • 2.
    AbstractAbstract Next - generationsequencingNext - generation sequencing Validation of sequencesValidation of sequences Donor/Acceptor sites - GT/AGDonor/Acceptor sites - GT/AG NCBI as primary DB for information scanningNCBI as primary DB for information scanning
  • 3.
    IntroductionIntroduction To measure thequality of sequencing, one needs a stretch of DNA/RNA withTo measure the quality of sequencing, one needs a stretch of DNA/RNA with high conservation, in which it is statistically very unlikely to find a variation. Suchhigh conservation, in which it is statistically very unlikely to find a variation. Such sequences found in all eukaryotes are the splicing site’s donor and acceptor pairs.sequences found in all eukaryotes are the splicing site’s donor and acceptor pairs. Donor - Acceptor Pairs:Donor - Acceptor Pairs: if not reverse complement then: GT or GC vs. AGif not reverse complement then: GT or GC vs. AG if reverse complement then: CT vs. AC or GCif reverse complement then: CT vs. AC or GC Counting of reverse complement sites:Counting of reverse complement sites: as CT is the RC of AG, it will be counted as AGas CT is the RC of AG, it will be counted as AG as AC is the RC of GT, it will be counted as GTas AC is the RC of GT, it will be counted as GT GC’s RC is GCGC’s RC is GC
  • 4.
    Materials and methodsMaterialsand methods The NCBI Genome entries for the Oryza sativa chromosomes were used to estimateThe NCBI Genome entries for the Oryza sativa chromosomes were used to estimate the sequencing error in the splicing donor/acceptor sites. The classical form of thethe sequencing error in the splicing donor/acceptor sites. The classical form of the splicing sites (GT/GC - AG) was used for the analysis. Only variations in thissplicing sites (GT/GC - AG) was used for the analysis. Only variations in this conservativeconservative Error rate by chromosome sequence were considered and any rareError rate by chromosome sequence were considered and any rare splicing sites (AT/AC) [2] found were not taken into account.splicing sites (AT/AC) [2] found were not taken into account. An alternative sequence of the rice genome was obtained from the Plant GenomeAn alternative sequence of the rice genome was obtained from the Plant Genome Database [1]. It was used to verify the splicing errors in the NCBI sequence. TheDatabase [1]. It was used to verify the splicing errors in the NCBI sequence. The positions of the Intron - Exon boundaries were taken from the annotation of thepositions of the Intron - Exon boundaries were taken from the annotation of the NCBI Nucleotide entries ofNCBI Nucleotide entries of the chromosomes.The respective boundaries in thethe chromosomes.The respective boundaries in the PGDB genome were selected by local pairwise alignment (BLAST) of thePGDB genome were selected by local pairwise alignment (BLAST) of the chromosomes of the two retrieved genomes. The fragments that did not enter the bestchromosomes of the two retrieved genomes. The fragments that did not enter the best BLAST result were ignored. We estimate the sequencing errors by calculating theBLAST result were ignored. We estimate the sequencing errors by calculating the frequency of appearance of sites that do not match the canonical form.frequency of appearance of sites that do not match the canonical form.
  • 5.
    Results and DiscussionResultsand Discussion 12 Chromosomes,12 Chromosomes, 225 981225 981 donor-donor- acceptor sites checked,acceptor sites checked, 33853385 differencesdifferences were found from the classical formwere found from the classical form This leads to an error rate ofThis leads to an error rate of 1.501.50 xx 10-10- 22. This is three orders of magnitude higher. This is three orders of magnitude higher than the estimated error rate by Wesche etthan the estimated error rate by Wesche et al. [3] for the referent mouse genome (wholeal. [3] for the referent mouse genome (whole genome shotgun sequence of the C57BL/6Jgenome shotgun sequence of the C57BL/6J line), and one order of magnitude higherline), and one order of magnitude higher than the estimated error for codingthan the estimated error for coding sequences in the Genbank records of mousesequences in the Genbank records of mouse genes.genes. Chart 1Chart 1
  • 6.
    Results based onlyonResults based only on NCBI dataNCBI data These results slightly differs from previous, but they showThese results slightly differs from previous, but they show us some inside information about the genome. We could analyzeus some inside information about the genome. We could analyze errors’ differences in genomes and predict what error we coulderrors’ differences in genomes and predict what error we could expect from sequencing other organism classified in certainexpect from sequencing other organism classified in certain group (plants, animal groups, etc.). The same manner could begroup (plants, animal groups, etc.). The same manner could be used for examining (verifying) results from NGS.used for examining (verifying) results from NGS. We analyzedWe analyzed 1212 chromosomes and discoveredchromosomes and discovered 36843684 differences fromdifferences from 226 270.226 270.
  • 7.
    Chart 2: Statisticsabout error by Chromosomes ifChart 2: Statistics about error by Chromosomes if the error in Genome for every site group is 100%the error in Genome for every site group is 100% Assuming: Every site group (GT/GC andAssuming: Every site group (GT/GC and AG) results its’ error for Genome, andAG) results its’ error for Genome, and this is 100%this is 100% The two groups are not going to have theThe two groups are not going to have the same trend linessame trend lines11 .. In the same time: as the chromosome isIn the same time: as the chromosome is bigger, the errors are going up too.bigger, the errors are going up too. It means that Chromosome 1 is producedIt means that Chromosome 1 is produced 15.08 % error level about GTC genome15.08 % error level about GTC genome error group.error group. 1 - Trend lines’ type is Moving Average1 - Trend lines’ type is Moving Average
  • 8.
    Chart 3: Statsabout the error for each site in eachChart 3: Stats about the error for each site in each ChromosomeChromosome Assuming: Every site group (GT/GC andAssuming: Every site group (GT/GC and AG) results its’ error for eachAG) results its’ error for each Chromosome, and this is 100%. So here weChromosome, and this is 100%. So here we show the error for each site group in eachshow the error for each site group in each chromosome.chromosome. The two groups are going to have theThe two groups are going to have the similar trend linessimilar trend lines11 .. In the same time: it is evident that errorIn the same time: it is evident that error level in AG is more than the error level inlevel in AG is more than the error level in GT/GC in relative sense.GT/GC in relative sense. It means that in Chromosome 1 for everyIt means that in Chromosome 1 for every 1000 GTC sites will be produced error about1000 GTC sites will be produced error about 16 wrong sites16 wrong sites 1 - Trend lines’ type is Polinomial1 - Trend lines’ type is Polinomial
  • 9.
    Chart 4: Statisticsabout error in Chromosomes ifChart 4: Statistics about error in Chromosomes if each Chromosome is 100%each Chromosome is 100% Assuming: Both site groups (GT/GC andAssuming: Both site groups (GT/GC and AG) results the error level for eachAG) results the error level for each Chromosome, and this is 100%Chromosome, and this is 100% The trend lineThe trend line11 of error level and the trendsof error level and the trends from Chart show us which site group isfrom Chart show us which site group is resulting more high level errors than theresulting more high level errors than the other for each chromosome.other for each chromosome. In the same time: there is no matter howIn the same time: there is no matter how much bps there are in the chromosome.much bps there are in the chromosome. It means that in Chromosome 1 for everyIt means that in Chromosome 1 for every 10000 sites will be produced error about 17610000 sites will be produced error about 176 sites.sites. This chart also shows how much differsThis chart also shows how much differs these results from the analyze withthese results from the analyze with verifying genome with PlantGDB . It isverifying genome with PlantGDB . It is important when we are going to examineimportant when we are going to examine sequenced and assembled data by differentsequenced and assembled data by different methods.methods. 1 - Trend lines’ type is Moving Average1 - Trend lines’ type is Moving Average
  • 10.
    Charts 5: Statsabout the error if whole Genome isCharts 5: Stats about the error if whole Genome is 100% - error and no error occurrence100% - error and no error occurrence Assuming: The whole GenomeAssuming: The whole Genome is 100%. Here are shown theis 100%. Here are shown the two groups NE (errors) andtwo groups NE (errors) and EQ (no errors) for eachEQ (no errors) for each chromosome. So their sum ischromosome. So their sum is 100%100% The two groups are going toThe two groups are going to have similar trend lineshave similar trend lines11 .. In the same time: as theIn the same time: as the chromosome is bigger, thechromosome is bigger, the rates are going up too.rates are going up too. It means that forIt means that for Chromosome 1 the error isChromosome 1 the error is 0.26% based on whole Genome,0.26% based on whole Genome, incl. no error sites.incl. no error sites. 1 - Trend lines’ type is Moving Average1 - Trend lines’ type is Moving Average
  • 11.
    ReferencesReferences Duvick, J., Fu,A., Muppirala, U., Sabharwal, M., Wilkerson, M.D., Lawrence, C.J.,Duvick, J., Fu, A., Muppirala, U., Sabharwal, M., Wilkerson, M.D., Lawrence, C.J., Lushbough, C. & Brendel, V. (2008) PlantGDB: a resource for comparative plantLushbough, C. & Brendel, V. (2008) PlantGDB: a resource for comparative plant genomics. Nucl. Acids Res. 36, D959-D965.genomics. Nucl. Acids Res. 36, D959-D965. Hall, S.L., Padgett, R.A. (1994). Conserved sequences in a class of rare eukaryoticHall, S.L., Padgett, R.A. (1994). Conserved sequences in a class of rare eukaryotic nuclear introns with non-consensus splicesites. J. Mol. Biol. 239 (3): 357–65.nuclear introns with non-consensus splicesites. J. Mol. Biol. 239 (3): 357–65. Wesche, P.L., Gaffney, D.J., Keightley, P.D. (2004) DNA sequence error rates inWesche, P.L., Gaffney, D.J., Keightley, P.D. (2004) DNA sequence error rates in Genbank records estimated using the mouse genome as reference. DNA sequenceGenbank records estimated using the mouse genome as reference. DNA sequence 15(5/6): 362-64.15(5/6): 362-64.
  • 12.
    Thank YouThank You Presentedby: Valeriya SimeonovaPresented by: Valeriya Simeonova