ECCB10 talk - Nextgen sequencing and SNPs

Next-generation sequencing
and SNPs
Jan Aerts
Wellcome Trust Sanger Institute
jan.aerts@gmail.com

Aim
To identify the SNP that causes disease,
phenotype
– Find them all, so you don’t miss it (false
negatives)
– Not find too many, so it’s useful (false
positives)

General principle
Map reads to reference sequence
Convert from read-based to base-based
(i.e. pileup)
Look at differences

This presentation
Factors in finding real SNPs
– Sequencing technology
– Mapping algorithms and initial calling
– Post-mapping tweaking
– Calling
– Filtering

Based on experiences in exome resequencing;
“experiment 5” on last slide Thomas

1. Sequencing
• Provides raw data
• Different technologies
Different accuracy (critical!)
Different types of errors

Accuracy
Base quality drops
along read
Sanger
> SOLiD
> Illumina
> 454
> Helicos

Base calling errors
Main source of error for Illumina, less in
SOLiD & 454

Homopolymer runs
• Especially 454
39% of errors are homopolymers
• A5 motifs: 3.3% error rate
• A8 motifs: 50% error rate!
Reason: use signal intensity as a measure for
homopolymer length

Consensus accuracy
Increase accuracy for SNP calling by
increasing coverage
– Illumina: 20X
– SOLiD: 12X
– 454: 7.4X
– Sanger: 3X

Factors: raw accuracy + read length

2. Mapping: fastq => bam
• Maq and bwa: only 1 mapping
If multiple: mapQ = 0
<=> mosaik & mrFAST: alternatives
• Maq and bwa: use paired-end
information => might prefer correct
distance over correct alignment

3. Post-mapping tweaking
Improve quality of mapped data:
– duplicate removal
– baseQ recalibration
– read clipping
– local realignment around indels

Genome Analysis Toolkit (GATK)
http://bit.ly/9zIn4b

Duplicate removal
PCR amplification bias
multiple reads with same start/stop =>
keep only one (with highest mapping Q)

java -Xmx2048m
-jar /path_to_picardtools/MarkDuplicates.jar
INPUT=input.bam
OUTPUT=output.bam
METRICS_FILE=output.metrics
VALIDATION_STRINGENCY=LENIENT

Picard

samtools

samtools rmdup input.bam output.bam

baseQ recalibration
• Why?
– correct for variation in quality with machine
cycle, sequence context, lane, baseQ…
• Steps:
– Identify what to correct for (create plots)
– Calculate covariates
– Apply covariates
– Check (create plots)

java -Xmx4g -jar GenomeAnalysisTK.jar
-l INFO
-R resources/Homo_sapiens_assembly18.fasta
--DBSNP resources/dbsnp_129_hg18.rod
-I my_reads.bam
-T CountCovariates
-cov ReadGroupCovariate
-cov QualityScoreCovariate
-cov DinucCovariate
-recalFile my_reads.recal_data.csv

java -Xmx4g -jar GenomeAnalysisTK.jar
-l INFO
-R resources/Homo_sapiens_assembly18.fasta
-I my_reads.bam
-T TableRecalibration
-outputBam my_reads.recal.bam
-recalFile my_reads.recal_data.csv

Read clipping
Remove:
– low quality strings of bases
– sections of reads
– reads containing user-provided sequences

java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar
-T RealignerTargetCreator
-R /path/to/reference.fasta
-o /path/to/output.intervals

java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir
-jar /path/to/GenomeAnalysisTK.jar
-I input.bam
-R ref.fasta
-T IndelRealigner
-targetIntervals /path/to/output.intervals
-o realignedBam.bam

4. SNP calling
• Different callers:
– samtools
– GATK UnifiedGenotyper
– SOAPsnp
–…
• Read-based => base-based

pileup

1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&<
1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

java
-Xmx6g
-jar /path_to/GenomeAnalysisTK.jar
-l INFO
-R human_b36_plus.fasta
-I input.bam
-T UnifiedGenotyper
--heterozygosity 0.001
-pl Solexa
-varout output.vcf
-vf VCF
-mbq 20
-mmq 10
-stand_call_conf 30.0
--DBSNP dbsnp_129_b36_plus.rod

GATK

samtools pileup
-vcs
-r 0.001
-l CCDS.txt
-f human_b36_plus.fasta
input.bam
output.pileup

samtools

VCF file
##fileformat=VCFv3.3
##FILTER=DP,"DP < 3 || DP > 1200"
##FILTER=QUAL,"QUAL < 25.0"
##FILTER=SnpCluster,"SNPs found in clusters"
##FORMAT=DP,1,Integer,"Read Depth"
##FORMAT=GQ,1,Integer,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=DB,0,Flag,"dbSNP Membership"
##INFO=DP,1,Integer,"Total Depth"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"
##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##annotatorReference=human_b36_plus.fasta
##reference=human_b36_plus.fasta
##source=VariantAnnotator
##source=VariantFiltration
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam
1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.00
1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00
. . .

VCF file
##fileformat=VCFv3.3
##FILTER=DP,"DP < 3 || DP > 1200"
##FILTER=QUAL,"QUAL < 25.0" header
##FILTER=SnpCluster,"SNPs found in clusters"
##FORMAT=DP,1,Integer,"Read Depth"
##FORMAT=GQ,1,Integer,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=DB,0,Flag,"dbSNP Membership"
##INFO=DP,1,Integer,"Total Depth"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"
##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##annotatorReference=human_b36_plus.fasta
##reference=human_b36_plus.fasta
##source=VariantAnnotator
##source=VariantFiltration
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam
1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.00
1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00
. . .

column header data

VCF file

INFO
DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE
DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE

FORMAT a_a:bwa057_b:picard.bam
GT:DP:GQ 1/1:3:36.00
GT:DP:GQ 1/1:6:45.00

Pileup => VCF
Custom scripts, then annotate
java -Xmx10g
-jar GenomeAnalysisTK.jar
-T VariantAnnotator
--assume_single_sample_reads sample
-D dbsnp_129_b36_plus.rod
-I input.bam
-B variant,VCF,unannotated.vcf
-o annotated.vcf
-A AlleleBalance
-A MappingQualityZero
-A LowMQ
-A RMSMappingQuality
-A HaplotypeScore
-A QualByDepth
-A DepthOfCoverage
-A HomopolymerRun

5. Filtering
• Aim: to reduce number of false positives
• Options:
– Depth of coverage
– Mapping quality
– SNP clusters
– Allelic balance
– Number of reads with mq0

java
-Xmx4g
-jar GenomeAnalysisTK.jar
-T VariantFiltration
-o output.vcf
-B variant,VCF,input.vcf
--clusterWindowSize 10
--filterExpression 'DP < 3 || DP > 1200'
--filterName 'DP'
--filterExpression 'QUAL < #{qual_cutoff}'
--filterName 'QUAL'
--filterExpression 'AB > 0.75 && DP > 40'
--filterName 'AB'

Filtering - QC metrics (1)
Transition/transversion ratio
Random: Ti/Tv = 0.5

Whole genome: 2.0-2.1
Exome: 3-3.5

Filtering - QC metrics (2)
Number of novel SNPs
Exome:
total 20k - 25k;
novel 1-3k

Combining discovery pipelines
• Mapper: MAQ/bwa/stampy/…
• BaseQ recalibration? Local
realignment?
• SNP caller: GATK/samtools/SOAPsnp
• Priors for SNP calling: heterozygosity
(whole genome, exome, dbSNP)
• Filtering


true positives

ROC

false positives

combinations

single

better

Indels
Still more tricky than SNPs
– samtools/dindel/GATK
– Sample of 10 individuals: on average per
individual:
• 2 novel functional high-quality SNPs
• 18 novel functional high-quality indels

“I trust manual interpretation of the reads more
than the basic quality parameters we use”

4 snp_1 STOP_GAINED
1 snp_2 STOP_LOST
1 snp_3 STOP_GAINED
1 snp_4 ESSENTIAL_SPLICE_SITE,INTRONIC
2 snp_6 STOP_GAINED
2 snp_7 STOP_GAINED
1 snp_8 STOP_GAINED
1 snp_9 STOP_GAINED
1 snp_10 STOP_GAINED
1 snp_13 STOP_LOST

4 snp_1 STOP_GAINED
1 snp_2 STOP_LOST
1 snp_3 STOP_GAINED
2 snp_6 STOP_GAINED
2 snp_7 STOP_GAINED
1 snp_8 STOP_GAINED
1 snp_9 STOP_GAINED
1 snp_13 STOP_LOST

178 indels FRAMESHIFT_CODING

Conclusions
Different tools exist and are created
Best to combine (intersect) the results from
different pipelines
Genome Analysis ToolKit (GATK) provides
useful bam-file processing tools:
– Realignment around indels
– Base quality recalibration

Use in resequencing
• Identify SNPs/indels
• Consequences (loss-of-function?)
• Prevalence in cases/controls
• Model:
– Dominant: any het
– Recessive: homnonref or compound het

References
• Chan E. In: Single Nucleotide Polymorphisms,
Methods in Molecular Biology 578 (2009)
• McKenna et al. Genome Res 20:1297-1303 (2010)
• Li H & Durbin R. Bioinformatics 25:1754-1760 (2009)
• Li H et al. Bioinformatics 25:2078-2079 (2009)
• Li H et al. Genome Res 18:1851-1858 (2008)

ECCB10 talk - Nextgen sequencing and SNPs

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to ECCB10 talk - Nextgen sequencing and SNPs

Similar to ECCB10 talk - Nextgen sequencing and SNPs (20)

More from Jan Aerts

More from Jan Aerts (20)

ECCB10 talk - Nextgen sequencing and SNPs