SNP Detection for Massively Parallel Whole-genome Sequencing
Genome Research 19:1124-1132, 2009Speaker: Eric C.Y., LEE
Aim• They want to developed a SNP calling method for Illumina platform.• Consider the data quality, alignment and experimental error common to this platform.
Applications of NGS• From whole genome sequence to know the gene variations between individuals. • Disease • Drug • Environment
Workﬂow Sequencing reads Map reads onto reference genome Prior probability of each genotypeRecalibrate sequencing quality scoreCalculate likelihood of each genotype Inferred genotype via Bayes theorem
Traditional Method• Phred score is a universal standard.• Compare the sample sequence with reference genome and ﬁlter low score mismatch.• A method to detect heterozygous polymorphisms.
Prior Probability• According to existing researches • The estimated SNP rate between two human haploid chromosome is about 0.001. (Sachidanandam et al. 2001). • Human reference genome sequence has an error rate of 0.00001. (Collins et al. 2004) Set the homozygous SNP at 0.0005, and the hetrozygous rate is 0.001.
Prior Probability• According to a previous study on dbSNP, transitions are four times more frequent than transversions among the substitution mutations. (Zhao and Boerwinkle 2002)
Alignment• Indels is the error source.• Using SOAP for alignment.
Recalibration• 3’ -end of reads have a much higher error rate than earlier cycles.• Original quality score can’t represent the true error rate.• Check the mismatch in dbSNP.
Recalibration• Illumina uses two lasers. • A and C use the same laser, G and T use another. • A-C and G-T substitution were 58%-72% overestimated.• Duplicate reads • Penalty for these reads.
Likelihood Calculation• Observed allele type• Quality score• Sequencing cycle• Observation of the same allele from reads with the same mapping location.
Evaluation• Comparison of the consensus sequence with Illumina human 1M BeadChip genotyped alleles from the same DNA sample showed genotyped alleles on the X chromosome and autosomes were covered at 99.97% and 99.84% consistency, respectively.