Next-generation sequencing - variation discovery

[I0D51A] Bioinformatics: High-Throughput Analysis
Next-generation sequencing.
Part 3: Variation discovery
Prof Jan Aerts
Faculty of Engineering - ESAT/SCD
jan.aerts@esat.kuleuven.be
TA: Alejandro Sifrim (alejandro.sifrim@esat.kuleuven.be)
1

Types of genomic variation
SNPs vs structural variation
3

A - Single nucleotide polymorphisms (SNPs)
4

What are SNPs and why are they important?
• SNP = single nucleotide polymorphism
• It’s the differences that matter:
• Human vs chimp: 98% identical (<2 differences every 100bp)
• Between any 2 individuals: 1 difference every 1000bp
• Disease: A or G == life or death
• Mutations can result in:
• change in level of transcription or translation (loss/gain)
• change in protein structure
5

SNP discovery - overview
generate sequence reads
➡ map reads to reference sequence
➡ convert from read-based to position-based (“pileup”)
➡ identify differences
7

Monet “Meule, Effet de Neige, le Matin”
Not a trivial problem...
12

Many SNP callers:
• samtools
• GATK
• SOAPsnp
• ...
Read-based -> position-based
Here: (1) samtools -> pileup; (2) GATK -> VCF
13

pileup
16
1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&<
1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
alignment mapping quality

Intermezzo: quality scores
“Phred-score”: used for sequence quality as well as mapping quality
Chance of 1/1000 that read is mapped at wrong position = 10-3 => phred-
score = 30
Chance of 1/100 that read is mapped at wrong position = 10-2 => phred-
score = 20
Sanger encoding: quality score 30 = “>”
17

pileup
18
1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&<
1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

Heterozygous SNPs and the binomial distribution
SNPs are bi-allelic => allele combinations for heterozygous SNP follow
binomial distribution
outcome = binary (red/white, head/tail, yes/no, A/G)
probability p of the outcome of a single draw is the same for all draws
E.g. 8 A’s + 12 G’s = SNP?
hypothesis: heterozygous => nr of draws = 20; nr of “successes” = 8;
probability p of outcome in single draw = 0.5
table with cumulative bionomial probabilities: http://bit.ly/cumul_binom_prob
8 A’s given coverage of 20 => cumulative probability = 0.252 > 0.05
=> heterozygote
19

samtools pileup
-vcs
-r 0.001
-l CCDS.txt
-f human_b36_plus.fasta
input.bam
output.pileup
samtools
21

VCF file
##fileformat=VCFv3.3
##FILTER=DP,"DP < 3 || DP > 1200"
##FILTER=QUAL,"QUAL < 25.0"
##FILTER=SnpCluster,"SNPs found in clusters"
##FORMAT=DP,1,Integer,"Read Depth"
##FORMAT=GQ,1,Integer,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=DB,0,Flag,"dbSNP Membership"
##INFO=DP,1,Integer,"Total Depth"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"
##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of
reads>"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##annotatorReference=human_b36_plus.fasta
##reference=human_b36_plus.fasta
##source=VariantAnnotator
##source=VariantFiltration
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam
1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE
GT:DP:GQ 1/1:3:36.00
1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE
GT:DP:GQ 1/1:6:45.00
. . .
22

VCF ﬁle
23
##FILTER=DP,"DP < 3 || DP > 1200"
reads>"
1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE
GT:DP:GQ 1/1:3:36.00
1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE
GT:DP:GQ 1/1:6:45.00
. . .
ﬁle header
column header
actual data

VCF ﬁle
24
INFO
DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE
DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE
FORMAT a_a:bwa057_b:picard.bam
GT:DP:GQ 1/1:3:36.00
GT:DP:GQ 1/1:6:45.00
genotype
depth
genotype
quality
1/1 = homozygous non-reference
0/1 = heterozygous

java
-Xmx6g
-jar /path_to/GenomeAnalysisTK.jar
-l INFO
-R human_b36_plus.fasta
-I input.bam
-T UniﬁedGenotyper
--heterozygosity 0.001
-pl Solexa
-varout output.vcf
-vf VCF
-mbq 20
-mmq 10
-stand_call_conf 30.0
--DBSNP dbsnp_129_b36_plus.rod
GATK
25

SNP annotation
26
by piculak (Flickr)

We have: chromosome + position + alleles
We need:
• in gene?
• damaging?
will be basis for ﬁltering
SIFT (http://sift.bii.a-star.edu/sg), annovar, PolyPhen, ...
27

28
3,81780820,1,T/C
2,43881517,1,A/T
2,43857514,1,T/C
#SNP codon substitution region type prediction gene OMIM
3,81780820,1,T/C AGA-gGA R190G EXON CDS Nonsynonymous DAMAGING GBE1 POLYGLUCOSAN BODY DISEASE
2,43881517,1,A/T ATA-tTA I230L EXON CDS Nonsynonymous TOLERATED DYNC2LI1
2,43857514,1,T/C TTT-TcT F33S EXON CDS Nonsynonymous TOLERATED DYNC2LI1
SIFT
input
output

29
3,81780820,1,T/C
2,43881517,1,A/T
2,43857514,1,T/C
#SNP codon substitution region type prediction gene OMIM
3,81780820,1,T/C AGA-gGA R190G EXON CDS Nonsynonymous DAMAGING GBE1 POLYGLUCOSAN BODY DISEASE
2,43881517,1,A/T ATA-tTA I230L EXON CDS Nonsynonymous TOLERATED DYNC2LI1
2,43857514,1,T/C TTT-TcT F33S EXON CDS Nonsynonymous TOLERATED DYNC2LI1
SIFT
input
output

SNP filtering
2 aspects:
• filtering to improve quality of SNP calls
• filtering to find likely candidates
30

Reduce false positives without increasing false negatives:
• depth of coverage
• mapping quality
• SNP clusters
• allelic balance (diploid genome)
• number of reads with mq0
• consequence
Filtering to improve quality
31

java
-Xmx4g
-jar GenomeAnalysisTK.jar
-T VariantFiltration
-R human_b36_plus.fasta
-o output.vcf
-B variant,VCF,input.vcf
--clusterWindowSize 10
--filterExpression 'DP < 3 || DP > 1200'
--filterName 'DP'
--filterExpression 'QUAL < 20'
--filterName 'QUAL'
--filterExpression 'AB > 0.75 && DP > 40'
--filterName 'AB'
GATK
32

VCF ﬁle
33
##FILTER=DP,"DP < 3 || DP > 1200"
reads>"
1 856182 rs9988021 G A 36.00 DP DB;DP=2;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE
GT:DP:GQ 1/1:3:36.00
1 866362 rs4372192 A G 45.00 PASSED DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE
GT:DP:GQ 1/1:6:45.00
. . .

Transition/transversion ratio
Transition/transversion ratio Ti/Tv
random: Ti/Tv = 0.5
whole genome: Ti/Tv = 2.0-2.1
exome: Ti/Tv = 3-3.5
34

Novel SNPs
Number of novel SNPs
exome:
total = 20k - 25k
novel = 1k - 3k
35

Factors that inﬂuence SNP accuracy
• sequencing technology
• mapping algorithms and parameters
• post-mapping manipulation
duplicate removal, base quality recalibration, local realignment around
indels, ...
• SNP calling algorithms and parameters
36

Speciﬁcity vs sensitivity
37
truepositives
false positives

Filtering to ﬁnd likely candidates
Which are the most interesting?
• only highqual: DP, QUAL, AB, but keep eye on Ti/Tv
• novel
• loss-of-function (stop gained, splice site, ...) or predicted to be damaging (non-
synonymous)
• found in multiple individuals
• conserved
• homozygous non-reference or compound heterozygous
38

Disease model
• dominant: a single heterozygous SNP is damaging
• recessive: either homozygous non-reference or compound heterozygous
necessary to lead to disease phenotype
(e.g. phenylketonuria: cannot convert phenylalanine to tyrosine. Can lead
to: mental retardation, microcephaly, ...)
39

Why bother?
Iafrate et al, Nat Genet 2004 & Sebat et al, Science 2004
Redon et al, Nature 2006: 12% of genome is covered by copy number variable
regions (270 individuals) => more nucleotide content per genome than SNPs
•colour vision in primates
•CCL3L1 copy number -> susceptibility to HIV
•AMY1 copy number -> diet
=> “the dynamic genome”
41

42
Case 1: Evolution - chromosome fusion

human chromosome 2
chimp chromosome 12
chimp chromosome 13
by Beth Kramer
43

Molecular Biology of the Cell, 4th Edition
colorectal cancer
karyotype
normal karyotype
44
Case 2: Cancer - rearranged genome

Robberecht et al, 2010
45
Case 3: Embryogenesis - “abnormal” cells
segmental chromosomal imbalances
mosaicism for whole chromosomes
uniparental isodisomy

46
Case 4: Down Syndrome = trisomy 21

Types of structural variation
Aerts & Tyler-Smith, In: Encyclopedia of Life Sciences, 2009
47

Types of structural variation
48
Aerts & Tyler-Smith, In: Encyclopedia of Life Sciences, 2009
CNV = Copy Number Variation

Copy number variation (CNV)
Not equally distributed over genome: more pericentromeric and subtelomeric
(especially in primates)
Pericentromeric & subtelomeric regions: bias towards interchromosomal
rearrangements; interstitial regions: bias towards intrachromosomal
Generation of duplications:
pericentromeric: 2-stage model (Sharp & Eichler, 2006)
1. series of seeding events: one of more progenitor loci transpose together
to pericentromeric receptor => generates mosaic block of duplicated
segments derived from different loci
2. inter- & intrachromosomal duplication => large blocks are duplicated
near other centromeres
subtelomeric: due to normal recombination: cross-overs lead to translocation
of distal sequences between chromosomes
49

Copy number variation and segmental duplications
Close relationship between CNVs and segmental duplications (aka low-copy repeats
aka LCRs; genomic regions with >1 copy that are at least 1kb long and have at least
90% sequence similarity):
• Copy number variation that is ﬁxed in population = segmental duplication (in other
words: segmental duplications started out themselves as copy number variations)
• Segmental duplications can stimulate formation of new CNVs due to NAHR (see
later)
➡In human + chimp: 70-80% of inversions and 40% of insertions/deletions overlap
with segmental duplications
➡80% of human segmental duplications arose after the divergence of Great Aples from
the rest of the primates
50

Effects of structural variation
51
Feuk et al, 2006

Mechanisms of formation for structural variation
52
Gu et al, 2008

Mechanisms: NAHR
53
NAHR = non-allelic homologous
recombination
often between segmental
duplications
•can recur
•clustered breakpoints
•larger
Hastings et al, 2009

Mechanisms: NHEJ
54
Gu et al, 2008
NHEJ = non-homologous end-joining
pathway to repair double-strand breaks, but may lead to
translocations and telomere fusion
not associated with segmental duplications
• more scattered
• unique origins
• smaller

Mechanisms: FoSTeS
55
Hastings et al, 2009
FoSTeS = DNA replication fork-stalling and
template switching
can occur multiple times in series => can generate
very complex rearrangements

Feuk et al, 2006
Discovery of structural variation
56

Approaches for discovery
• karyotyping, ﬂuorescent in situ hybridization (FISH)
• array comparative genomic hybridization (aCGH)
• next-generation sequencing: combination of:
• read pair information
• read depth information
• split read information
• for ﬁne-mapping breakpoints: local assembly
=> identify signatures
57

Feuk et al, 2006
Feuk et al, 2006
Feuk et al, 2006
FISH = ﬂuorescent in-silico hybridization
duplication
inversion
duplication
Structural variation discovery using FISH
58

Structural variation discovery using aCGH
59
Xie & Tammi, 2009
aCGH = array comparative genome hybridization

60
http://www.breenlab.org/array.html

Structural variation discovery using next-generation
sequencing
General approaches:
1.Read depth
2.Read pairs
3.Split reads
62

Structural variation discovery: read depth
Xie & Tammi, 2009
63

Workflow
1.Mapping
2.Read filtering
3.GC correction
4.Spike identification
5.Validation
64

General principle
• Similar to aCGH: using reference RD ﬁle (e.g. from 1000Genomes Project)
• In theory: higher resolution, but noisier than aCGH
• Algorithms not mature yet
• More complex steps
➡Data binned
65

69
CNV = copy number variation
Combining CNV data for >1 individuals/samples

70
CNVR = copy number variation region
CNVR = any region covered by at least 1 CNV

71
CNVE = copy number variation event
CNVE = subgroups of CNVR with >= 50% reciprocal overlap

Data normalization
• Mainly: GC
• Other: repeat-rich regions, mapping Q, ...
• Fit linear model GC-content and RD => noise decreases
72

Segmentation
• Identify spikes
• Many segmentational algorithms, e.g. GADA
• Issues: setting parameters: when to cut off peaks?
• Combine outputs from different runs with different parameters
• Compare to known CNVs
73

764443
Xie & Tammi, 2009
...but is this?

Drawbacks
• Can only ﬁnd unbalanced structural variation (i.e. CNVs)
• How to assess speciﬁcity and sensitivity? => compare with known CNVs
• Database of Genomic Variants DGV (http://projects.tcag.ca/variation/)
• Decipher (http://decipher.sanger.ac.uk/)
• Breakpoints: unknown
• Different parameters for rare vs common CNVs => which?
78

Structural variation discovery: read pairs
79
50
Korbel et al, 2007

Discordant readpairs
• Orientation
• Distance
• Plot insert size distribution for chromosome
• Very long tail!! => difﬁcult to set cutoff: 4 MAD or 0.01%?
80

Read pair signatures
Medvedev et al, 2009
81

Read pair workﬂow
1. Map reads
2. Identify discordant pairs
3. Cluster on location
4. Filter on number of readpairs per cluster
5. Filter on read depth
6. Filter on mapping quality for read pairs
7. Identify signatures
8. (Optionally) create alternative reference
9. Validate
83

Clustering
• “standard clustering strategy”
• only consider mate pairs that do not have concordant mappings
• ignore read pairs that have more than one good mapping
• clustering: use insert size distribution (e.g. 2x4 MAD)
89

Clustering: issues
• Ignores pairs that have >1 good mapping => no detection within repetitive
regions (segmental duplications)
• What cutoff for what is considered abnormal distance? (4 MAD? 0.01%?
2stdev?)
• Low library quality of mix of libraries => multiple peaks in size distribution
90

Filtering
• On number of RPs per cluster
• normally: n = 2
• for high coverage (e.g. 1000Genomes pilot 2: 80X): n = 5
• On drop in read depth and split reads
• On (mappingQ x nrRP)
• if published data available: look at speciﬁcity and sensitivity for different cutoffs
mQ x nrRP
• if not: very difﬁcult
91

Filtering: issues
• Large insert size: low resolution for detecting breakpoints
• Small insert size: low resolution for detecting complex regions
92

Structural variation discovery: split reads
93

Mapping
• short subsequences => many possible mappings
• solution: “anchored split mapping” (e.g. Pindel)
94

Local reassembly
• Aim: to determine breakpoints
• Which reads?
• for deletions: local reads
• for insertions: hanging reads for read pairs with only one read mapped
• (rather not: unmapped reads)
• For large region: split up
95

97
Nielsen et al, 2009
sequence reads -> contigs (using sequence overlap)
contigs -> scaffolds (using read-pair information)
1 scaffold contigs

98
+ -
read depth
read pairs
split reads
conceptually simple
only unbalanced (CNVs)
low resolution
wide range of types of
variation
complicated
basepair resolution very small reads
General conclusions NGS & structural variation (1)

General conclusions NGS & structural variation (2)
• Available algorithms: more to demonstrate technique than comprehensive
solution
• Difﬁcult => different software = different results => “consensus set”
• based on read pairs and split reads: many sets agree
• based on read depth: totally different
• sometimes drop in read depth, but no aberrant read pairs spanning the region
=> why???
• Mapper = critical; maq/bwa: only 1 mapping (=> many false negatives); mosaik,
mrFAST: return more results
99

Software for structural variation discovery
100

103
Websites
http://www.broadinstitute.org/gatk
http://samtools.sourceforge.net
http://picard.sourceforge.net
http://www.annotate-it.org
http://bit.ly/siftsnp

References and software
• Medvedev P et al. Nat Methods 6(11):S13-S20 (2009)
• Lee S et al. Bioinformatics 24:i59-i67 (2008)
• Hormozdiari F et al. Genome Res 19:1270-1278 (2009)
• Campbell P et al. Nat Genet 40:722-729 (2008)
• Ye K et al. Bioinformatics 25(21):2865-2871 (2009)
• Chen K et al. Genome Res 19:1527-1741 (2009)
• Yoon S et al. Genome Res 19:1586-1592 (2009)
• Du J et al PLoS Comp Biol 5(7):e1000432 (2009)
• Aerts J & Tyler-Smith C. In: Encyclopedia of Life Sciences (2009)
• Hastings P et al Nat Rev Genet 10:551-564 (2009)
104

Finding SNPs using Galaxy
Based on the SAM-file you created in Galaxy in the last lecture, create a list of
SNPs. You’ll first have to convert the SAM file to BAM, then create a pileup and
finally filter the pileup (using “Filter pileup on coverage and SNPs”). Let this filter
only return variants where the coverage is larger than 3 and the base quality is
larger than 20.
How many SNPs do you find?
Calculate a histogram of the coverage over all SNPs (= column 4 in the filtered
file you just created)
106

Finding SNPs using samtools
Using the SAM file you created in the last lecture on the linux command line:
Generate a BAM file and sort it. Next, generate a pileup for that BAM file using
~jaerts/i0d51a/chr9.fa as the reference sequence. When doing this: only print
the variant sites and also compute the reference sequence (run “samtools
pileup” without arguments to get more info).
How many SNPs are identified? Is the SNP at position 139,391,636
heterozygous or homozygous-non-reference? And the one at 139,399,365? Do
you trust the SNP at 139,401,304?
107

Annotating and ﬁltering SNPs
Download ~jaerts/i0d51a/sift.input to your own machine and then upload it to
the SIFT website at http://bit.ly/siftsnp. Positions in this ﬁle are on Homo
sapiens build NCBI36. Make sure to let SIFT send the results by email.
How many SNPs are in/near genes?
How many are in exons?
What percentage of the SNPs is predicted damaging?
108

Structural variation
We’ll be looking at copy number variation using the cnv-seq package. This
software is available from http://tiger.dbs.nus.edu.sg/cnv-seq/
We’ll be running the example from the cnv-seq tutorial at http://
tiger.dbs.nus.edu.sg/cnv-seq/doc/manual.pdf. (Read that!)
•Log into the server mentioned on Toledo.
•Calculate CNVs in the ﬁle ~jaerts/i0d51a/test_1.hits compared to ~jaerts/
i0d51a/ref_1.hits:
/mnt/apps/cnv-seq/current/cnv-seq.pl --test ~jaerts/i0d51a/test_1.hits --ref ~jaerts/i0d51a/ref_1.hits --
genome chrom1 --log2 0.6 -p 0.001 --bigger-window 1.5 --annotate --minimum-windows 4
•Finally investigate in R. Start R by typing “R”. Then:
library(cnv)
data <- read.delim(’test_1.hits-vs-ref_1.hits.log2-0.6.pvalue-0.001.miw-4.cnv’)
cnv.print(data)
cnv.summary(data)
plot.cnv(data, CNV=4, upstream=4e+6, downstream=4e+6)
ggsave(’sample_1.pdf’)
•Describe the main features in the plot.
109

Next-generation sequencing - variation discovery

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to Next-generation sequencing - variation discovery

Similar to Next-generation sequencing - variation discovery (20)

More from Jan Aerts

More from Jan Aerts (20)

Recently uploaded

Recently uploaded (20)

Next-generation sequencing - variation discovery