SlideShare a Scribd company logo
[I0D51A] Bioinformatics: High-Throughput Analysis
Next-generation sequencing.
Part 3: Variation discovery
Prof Jan Aerts
Faculty of Engineering - ESAT/SCD
jan.aerts@esat.kuleuven.be
TA: Alejandro Sifrim (alejandro.sifrim@esat.kuleuven.be)
1
Context
2
Types of genomic variation
SNPs vs structural variation
3
A - Single nucleotide polymorphisms (SNPs)
4
What are SNPs and why are they important?
• SNP = single nucleotide polymorphism
• It’s the differences that matter:
• Human vs chimp: 98% identical (<2 differences every 100bp)
• Between any 2 individuals: 1 difference every 1000bp
• Disease: A or G == life or death
• Mutations can result in:
• change in level of transcription or translation (loss/gain)
• change in protein structure
5
6
SNP discovery - overview
generate sequence reads
➡ map reads to reference sequence
➡ convert from read-based to position-based (“pileup”)
➡ identify differences
7
8
9
10
11
Monet “Meule, Effet de Neige, le Matin”
Not a trivial problem...
12
Many SNP callers:
• samtools
• GATK
• SOAPsnp
• ...
Read-based -> position-based
Here: (1) samtools -> pileup; (2) GATK -> VCF
13
pileup
14
15
pileup
16
1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&<
1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
alignment mapping quality
Intermezzo: quality scores
“Phred-score”: used for sequence quality as well as mapping quality
Chance of 1/1000 that read is mapped at wrong position = 10-3 => phred-
score = 30
Chance of 1/100 that read is mapped at wrong position = 10-2 => phred-
score = 20
Sanger encoding: quality score 30 = “>”
17
pileup
18
1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&<
1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
Heterozygous SNPs and the binomial distribution
SNPs are bi-allelic => allele combinations for heterozygous SNP follow
binomial distribution
outcome = binary (red/white, head/tail, yes/no, A/G)
probability p of the outcome of a single draw is the same for all draws
E.g. 8 A’s + 12 G’s = SNP?
hypothesis: heterozygous => nr of draws = 20; nr of “successes” = 8;
probability p of outcome in single draw = 0.5
table with cumulative bionomial probabilities: http://bit.ly/cumul_binom_prob
8 A’s given coverage of 20 => cumulative probability = 0.252 > 0.05
=> heterozygote
19
20
samtools pileup 
-vcs 
-r 0.001 
-l CCDS.txt 
-f human_b36_plus.fasta 
input.bam 
output.pileup
samtools
21
VCF file
##fileformat=VCFv3.3
##FILTER=DP,"DP < 3 || DP > 1200"
##FILTER=QUAL,"QUAL < 25.0"
##FILTER=SnpCluster,"SNPs found in clusters"
##FORMAT=DP,1,Integer,"Read Depth"
##FORMAT=GQ,1,Integer,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=DB,0,Flag,"dbSNP Membership"
##INFO=DP,1,Integer,"Total Depth"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"
##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of
reads>"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##annotatorReference=human_b36_plus.fasta
##reference=human_b36_plus.fasta
##source=VariantAnnotator
##source=VariantFiltration
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam	
1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE
GT:DP:GQ 1/1:3:36.00
1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE
GT:DP:GQ 1/1:6:45.00
. . .
22
VCF file
23
##fileformat=VCFv3.3
##FILTER=DP,"DP < 3 || DP > 1200"
##FILTER=QUAL,"QUAL < 25.0"
##FILTER=SnpCluster,"SNPs found in clusters"
##FORMAT=DP,1,Integer,"Read Depth"
##FORMAT=GQ,1,Integer,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=DB,0,Flag,"dbSNP Membership"
##INFO=DP,1,Integer,"Total Depth"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"
##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of
reads>"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##annotatorReference=human_b36_plus.fasta
##reference=human_b36_plus.fasta
##source=VariantAnnotator
##source=VariantFiltration
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam	
1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE
GT:DP:GQ 1/1:3:36.00
1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE
GT:DP:GQ 1/1:6:45.00
. . .
file header
column header
actual data
VCF file
24
INFO
DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE
DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE
FORMAT a_a:bwa057_b:picard.bam
GT:DP:GQ 1/1:3:36.00
GT:DP:GQ 1/1:6:45.00
genotype
depth
genotype
quality
1/1 = homozygous non-reference
0/1 = heterozygous
java 
-Xmx6g 
-jar /path_to/GenomeAnalysisTK.jar 
-l INFO 
-R human_b36_plus.fasta 
-I input.bam 
-T UnifiedGenotyper 
--heterozygosity 0.001 
-pl Solexa 
-varout output.vcf 
-vf VCF 
-mbq 20 
-mmq 10 
-stand_call_conf 30.0 
--DBSNP dbsnp_129_b36_plus.rod
GATK
25
SNP annotation
26
by piculak (Flickr)
We have: chromosome + position + alleles
We need:
• in gene?
• damaging?
will be basis for filtering
SIFT (http://sift.bii.a-star.edu/sg), annovar, PolyPhen, ...
27
28
3,81780820,1,T/C
2,43881517,1,A/T
2,43857514,1,T/C
#SNP	 	 	 codon	 	 substitution	 region	 	 type	 	 	 prediction	 gene	 	 OMIM
3,81780820,1,T/C	 AGA-gGA	 R190G		 EXON CDS	 Nonsynonymous	 DAMAGING	 GBE1	 	 POLYGLUCOSAN BODY DISEASE
2,43881517,1,A/T	 ATA-tTA	 I230L	 	 EXON CDS	 Nonsynonymous	 TOLERATED	 DYNC2LI1
2,43857514,1,T/C	 TTT-TcT	 F33S	 	 EXON CDS	 Nonsynonymous	 TOLERATED	 DYNC2LI1
SIFT
input
output
29
3,81780820,1,T/C
2,43881517,1,A/T
2,43857514,1,T/C
#SNP	 	 	 codon	 	 substitution	 region	 	 type	 	 	 prediction	 gene	 	 OMIM
3,81780820,1,T/C	 AGA-gGA	 R190G		 EXON CDS	 Nonsynonymous	 DAMAGING	 GBE1	 	 POLYGLUCOSAN BODY DISEASE
2,43881517,1,A/T	 ATA-tTA	 I230L	 	 EXON CDS	 Nonsynonymous	 TOLERATED	 DYNC2LI1
2,43857514,1,T/C	 TTT-TcT	 F33S	 	 EXON CDS	 Nonsynonymous	 TOLERATED	 DYNC2LI1
SIFT
input
output
SNP filtering
2 aspects:
• filtering to improve quality of SNP calls
• filtering to find likely candidates
30
Reduce false positives without increasing false negatives:
• depth of coverage
• mapping quality
• SNP clusters
• allelic balance (diploid genome)
• number of reads with mq0
• consequence
Filtering to improve quality
31
java 
-Xmx4g 
-jar GenomeAnalysisTK.jar 
-T VariantFiltration 
-R human_b36_plus.fasta 
-o output.vcf 
-B variant,VCF,input.vcf 
--clusterWindowSize 10 
--filterExpression 'DP < 3 || DP > 1200' 
--filterName 'DP' 
--filterExpression 'QUAL < 20' 
--filterName 'QUAL' 
--filterExpression 'AB > 0.75 && DP > 40' 
--filterName 'AB'
GATK
32
VCF file
33
##fileformat=VCFv3.3
##FILTER=DP,"DP < 3 || DP > 1200"
##FILTER=QUAL,"QUAL < 25.0"
##FILTER=SnpCluster,"SNPs found in clusters"
##FORMAT=DP,1,Integer,"Read Depth"
##FORMAT=GQ,1,Integer,"Genotype Quality"
##FORMAT=GT,1,String,"Genotype"
##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"
##INFO=DB,0,Flag,"dbSNP Membership"
##INFO=DP,1,Integer,"Total Depth"
##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"
##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"
##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of
reads>"
##INFO=MQ,1,Float,"RMS Mapping Quality"
##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"
##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"
##annotatorReference=human_b36_plus.fasta
##reference=human_b36_plus.fasta
##source=VariantAnnotator
##source=VariantFiltration
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam	
1 856182 rs9988021 G A 36.00 DP DB;DP=2;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE
GT:DP:GQ 1/1:3:36.00
1 866362 rs4372192 A G 45.00 PASSED DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE
GT:DP:GQ 1/1:6:45.00
. . .
Transition/transversion ratio
Transition/transversion ratio Ti/Tv
random: Ti/Tv = 0.5
whole genome: Ti/Tv = 2.0-2.1
exome: Ti/Tv = 3-3.5
34
Novel SNPs
Number of novel SNPs
exome:
total = 20k - 25k
novel = 1k - 3k
35
Factors that influence SNP accuracy
• sequencing technology
• mapping algorithms and parameters
• post-mapping manipulation
duplicate removal, base quality recalibration, local realignment around
indels, ...
• SNP calling algorithms and parameters
36
Specificity vs sensitivity
37
truepositives
false positives
Filtering to find likely candidates
Which are the most interesting?
• only highqual: DP, QUAL, AB, but keep eye on Ti/Tv
• novel
• loss-of-function (stop gained, splice site, ...) or predicted to be damaging (non-
synonymous)
• found in multiple individuals
• conserved
• homozygous non-reference or compound heterozygous
38
Disease model
• dominant: a single heterozygous SNP is damaging
• recessive: either homozygous non-reference or compound heterozygous
necessary to lead to disease phenotype
(e.g. phenylketonuria: cannot convert phenylalanine to tyrosine. Can lead
to: mental retardation, microcephaly, ...)
39
B - Structural variation
40
Why bother?
Iafrate et al, Nat Genet 2004 & Sebat et al, Science 2004
Redon et al, Nature 2006: 12% of genome is covered by copy number variable
regions (270 individuals) => more nucleotide content per genome than SNPs
•colour vision in primates
•CCL3L1 copy number -> susceptibility to HIV
•AMY1 copy number -> diet
=> “the dynamic genome”
41
42
Case 1: Evolution - chromosome fusion
human chromosome 2
chimp chromosome 12
chimp chromosome 13
by Beth Kramer
43
Molecular Biology of the Cell, 4th Edition
colorectal cancer
karyotype
normal karyotype
44
Case 2: Cancer - rearranged genome
Robberecht et al, 2010
45
Case 3: Embryogenesis - “abnormal” cells
segmental chromosomal imbalances
mosaicism for whole chromosomes
uniparental isodisomy
46
Case 4: Down Syndrome = trisomy 21
Types of structural variation
Aerts & Tyler-Smith, In: Encyclopedia of Life Sciences, 2009
47
Types of structural variation
48
Aerts & Tyler-Smith, In: Encyclopedia of Life Sciences, 2009
CNV = Copy Number Variation
Copy number variation (CNV)
Not equally distributed over genome: more pericentromeric and subtelomeric
(especially in primates)
Pericentromeric & subtelomeric regions: bias towards interchromosomal
rearrangements; interstitial regions: bias towards intrachromosomal
Generation of duplications:
pericentromeric: 2-stage model (Sharp & Eichler, 2006)
1. series of seeding events: one of more progenitor loci transpose together
to pericentromeric receptor => generates mosaic block of duplicated
segments derived from different loci
2. inter- & intrachromosomal duplication => large blocks are duplicated
near other centromeres
subtelomeric: due to normal recombination: cross-overs lead to translocation
of distal sequences between chromosomes
49
Copy number variation and segmental duplications
Close relationship between CNVs and segmental duplications (aka low-copy repeats
aka LCRs; genomic regions with >1 copy that are at least 1kb long and have at least
90% sequence similarity):
• Copy number variation that is fixed in population = segmental duplication (in other
words: segmental duplications started out themselves as copy number variations)
• Segmental duplications can stimulate formation of new CNVs due to NAHR (see
later)
➡In human + chimp: 70-80% of inversions and 40% of insertions/deletions overlap
with segmental duplications
➡80% of human segmental duplications arose after the divergence of Great Aples from
the rest of the primates
50
Effects of structural variation
51
Feuk et al, 2006
Mechanisms of formation for structural variation
52
Gu et al, 2008
Mechanisms: NAHR
53
NAHR = non-allelic homologous
recombination
often between segmental
duplications
•can recur
•clustered breakpoints
•larger
Hastings et al, 2009
Mechanisms: NHEJ
54
Gu et al, 2008
NHEJ = non-homologous end-joining
pathway to repair double-strand breaks, but may lead to
translocations and telomere fusion
not associated with segmental duplications
• more scattered
• unique origins
• smaller
Mechanisms: FoSTeS
55
Hastings et al, 2009
FoSTeS = DNA replication fork-stalling and
template switching
can occur multiple times in series => can generate
very complex rearrangements
Feuk et al, 2006
Discovery of structural variation
56
Approaches for discovery
• karyotyping, fluorescent in situ hybridization (FISH)
• array comparative genomic hybridization (aCGH)
• next-generation sequencing: combination of:
• read pair information
• read depth information
• split read information
• for fine-mapping breakpoints: local assembly
=> identify signatures
57
Feuk et al, 2006
Feuk et al, 2006
Feuk et al, 2006
FISH = fluorescent in-silico hybridization
duplication
inversion
duplication
Structural variation discovery using FISH
58
Structural variation discovery using aCGH
59
Xie & Tammi, 2009
aCGH = array comparative genome hybridization
60
http://www.breenlab.org/array.html
61
van de Wiel et al, 2010
Structural variation discovery using next-generation
sequencing
General approaches:
1.Read depth
2.Read pairs
3.Split reads
62
Structural variation discovery: read depth
Xie & Tammi, 2009
63
Workflow
1.Mapping
2.Read filtering
3.GC correction
4.Spike identification
5.Validation
64
General principle
• Similar to aCGH: using reference RD file (e.g. from 1000Genomes Project)
• In theory: higher resolution, but noisier than aCGH
• Algorithms not mature yet
• More complex steps
➡Data binned
65
66
67
van de Wiel et al, 2010
Xie & Tammi, 2009
68
69
CNV = copy number variation
Combining CNV data for >1 individuals/samples
70
CNVR = copy number variation region
CNVR = any region covered by at least 1 CNV
71
CNVE = copy number variation event
CNVE = subgroups of CNVR with >= 50% reciprocal overlap
Data normalization
• Mainly: GC
• Other: repeat-rich regions, mapping Q, ...
• Fit linear model GC-content and RD => noise decreases
72
Segmentation
• Identify spikes
• Many segmentational algorithms, e.g. GADA
• Issues: setting parameters: when to cut off peaks?
• Combine outputs from different runs with different parameters
• Compare to known CNVs
73
74
Xie & Tammi, 2009
7543
Xie & Tammi, 2009
peak
764443
Xie & Tammi, 2009
...but is this?
77
Abysov et al
Drawbacks
• Can only find unbalanced structural variation (i.e. CNVs)
• How to assess specificity and sensitivity? => compare with known CNVs
• Database of Genomic Variants DGV (http://projects.tcag.ca/variation/)
• Decipher (http://decipher.sanger.ac.uk/)
• Breakpoints: unknown
• Different parameters for rare vs common CNVs => which?
78
Structural variation discovery: read pairs
79
50
Korbel et al, 2007
Discordant readpairs
• Orientation
• Distance
• Plot insert size distribution for chromosome
• Very long tail!! => difficult to set cutoff: 4 MAD or 0.01%?
80
Read pair signatures
Medvedev et al, 2009
81
Real data
82
Read pair workflow
1. Map reads
2. Identify discordant pairs
3. Cluster on location
4. Filter on number of readpairs per cluster
5. Filter on read depth
6. Filter on mapping quality for read pairs
7. Identify signatures
8. (Optionally) create alternative reference
9. Validate
83
84
figure by Klaudia Walter
85
figure by Klaudia Walter
86
figure by Klaudia Walter
87
figure by Klaudia Walter
88
figure by Klaudia Walter
Clustering
• “standard clustering strategy”
• only consider mate pairs that do not have concordant mappings
• ignore read pairs that have more than one good mapping
• clustering: use insert size distribution (e.g. 2x4 MAD)
89
Clustering: issues
• Ignores pairs that have >1 good mapping => no detection within repetitive
regions (segmental duplications)
• What cutoff for what is considered abnormal distance? (4 MAD? 0.01%?
2stdev?)
• Low library quality of mix of libraries => multiple peaks in size distribution
90
Filtering
• On number of RPs per cluster
• normally: n = 2
• for high coverage (e.g. 1000Genomes pilot 2: 80X): n = 5
• On drop in read depth and split reads
• On (mappingQ x nrRP)
• if published data available: look at specificity and sensitivity for different cutoffs
mQ x nrRP
• if not: very difficult
91
Filtering: issues
• Large insert size: low resolution for detecting breakpoints
• Small insert size: low resolution for detecting complex regions
92
Structural variation discovery: split reads
93
Mapping
• short subsequences => many possible mappings
• solution: “anchored split mapping” (e.g. Pindel)
94
Medvedev et al, 2009
Local reassembly
• Aim: to determine breakpoints
• Which reads?
• for deletions: local reads
• for insertions: hanging reads for read pairs with only one read mapped
• (rather not: unmapped reads)
• For large region: split up
95
96
97
Nielsen et al, 2009
sequence reads -> contigs (using sequence overlap)
contigs -> scaffolds (using read-pair information)
1 scaffold contigs
98
+ -
read depth
read pairs
split reads
conceptually simple
only unbalanced (CNVs)
low resolution
wide range of types of
variation
complicated
basepair resolution very small reads
General conclusions NGS & structural variation (1)
General conclusions NGS & structural variation (2)
• Available algorithms: more to demonstrate technique than comprehensive
solution
• Difficult => different software = different results => “consensus set”
• based on read pairs and split reads: many sets agree
• based on read depth: totally different
• sometimes drop in read depth, but no aberrant read pairs spanning the region
=> why???
• Mapper = critical; maq/bwa: only 1 mapping (=> many false negatives); mosaik,
mrFAST: return more results
99
Software for structural variation discovery
100
Medvedev et al, 2009
Chris Yoon
101
Chris Yoon
102
103
Websites
http://www.broadinstitute.org/gatk
http://samtools.sourceforge.net
http://picard.sourceforge.net
http://www.annotate-it.org
http://bit.ly/siftsnp
References and software
• Medvedev P et al. Nat Methods 6(11):S13-S20 (2009)
• Lee S et al. Bioinformatics 24:i59-i67 (2008)
• Hormozdiari F et al. Genome Res 19:1270-1278 (2009)
• Campbell P et al. Nat Genet 40:722-729 (2008)
• Ye K et al. Bioinformatics 25(21):2865-2871 (2009)
• Chen K et al. Genome Res 19:1527-1741 (2009)
• Yoon S et al. Genome Res 19:1586-1592 (2009)
• Du J et al PLoS Comp Biol 5(7):e1000432 (2009)
• Aerts J & Tyler-Smith C. In: Encyclopedia of Life Sciences (2009)
• Hastings P et al Nat Rev Genet 10:551-564 (2009)
104
Exercises
105
Finding SNPs using Galaxy
Based on the SAM-file you created in Galaxy in the last lecture, create a list of
SNPs. You’ll first have to convert the SAM file to BAM, then create a pileup and
finally filter the pileup (using “Filter pileup on coverage and SNPs”). Let this filter
only return variants where the coverage is larger than 3 and the base quality is
larger than 20.
How many SNPs do you find?
Calculate a histogram of the coverage over all SNPs (= column 4 in the filtered
file you just created)
106
Finding SNPs using samtools
Using the SAM file you created in the last lecture on the linux command line:
Generate a BAM file and sort it. Next, generate a pileup for that BAM file using
~jaerts/i0d51a/chr9.fa as the reference sequence. When doing this: only print
the variant sites and also compute the reference sequence (run “samtools
pileup” without arguments to get more info).
How many SNPs are identified? Is the SNP at position 139,391,636
heterozygous or homozygous-non-reference? And the one at 139,399,365? Do
you trust the SNP at 139,401,304?
107
Annotating and filtering SNPs
Download ~jaerts/i0d51a/sift.input to your own machine and then upload it to
the SIFT website at http://bit.ly/siftsnp. Positions in this file are on Homo
sapiens build NCBI36. Make sure to let SIFT send the results by email.
How many SNPs are in/near genes?
How many are in exons?
What percentage of the SNPs is predicted damaging?
108
Structural variation
We’ll be looking at copy number variation using the cnv-seq package. This
software is available from http://tiger.dbs.nus.edu.sg/cnv-seq/
We’ll be running the example from the cnv-seq tutorial at http://
tiger.dbs.nus.edu.sg/cnv-seq/doc/manual.pdf. (Read that!)
•Log into the server mentioned on Toledo.
•Calculate CNVs in the file ~jaerts/i0d51a/test_1.hits compared to ~jaerts/
i0d51a/ref_1.hits:
/mnt/apps/cnv-seq/current/cnv-seq.pl --test ~jaerts/i0d51a/test_1.hits --ref ~jaerts/i0d51a/ref_1.hits --
genome chrom1 --log2 0.6 -p 0.001 --bigger-window 1.5 --annotate --minimum-windows 4
•Finally investigate in R. Start R by typing “R”. Then:
library(cnv)
data <- read.delim(’test_1.hits-vs-ref_1.hits.log2-0.6.pvalue-0.001.miw-4.cnv’)
cnv.print(data)
cnv.summary(data)
plot.cnv(data, CNV=4, upstream=4e+6, downstream=4e+6)
ggsave(’sample_1.pdf’)
•Describe the main features in the plot.
109

More Related Content

Viewers also liked

Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotations
Thomas Keane
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data
Surya Saha
 
NGS Data Preprocessing
NGS Data PreprocessingNGS Data Preprocessing
NGS Data Preprocessing
cursoNGS
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
Maté Ongenaert
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Thomas Keane
 
NGS - QC & Dataformat
NGS - QC & Dataformat NGS - QC & Dataformat
NGS - QC & Dataformat
Karan Veer Singh
 
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent DataIonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
Adrian Baez-Ortega
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
Maté Ongenaert
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
rjorton
 
Case studies of HTS / NGS applications
Case studies of HTS / NGS applicationsCase studies of HTS / NGS applications
Case studies of HTS / NGS applications
rjorton
 

Viewers also liked (11)

Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotations
 
Quality Control of NGS Data
Quality Control of NGS Data Quality Control of NGS Data
Quality Control of NGS Data
 
NGS Data Preprocessing
NGS Data PreprocessingNGS Data Preprocessing
NGS Data Preprocessing
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1
 
NGS - QC & Dataformat
NGS - QC & Dataformat NGS - QC & Dataformat
NGS - QC & Dataformat
 
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent DataIonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
Ngs ppt
Ngs pptNgs ppt
Ngs ppt
 
Case studies of HTS / NGS applications
Case studies of HTS / NGS applicationsCase studies of HTS / NGS applications
Case studies of HTS / NGS applications
 

Similar to Next-generation sequencing - variation discovery

ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsJan Aerts
 
Comparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerlComparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerl
Jason Stajich
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
Li Shen
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGS
cursoNGS
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
Bioinformatics and Computational Biosciences Branch
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
HAMNAHAMNA8
 
Modware
ModwareModware
Modware
bosc
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
Li Shen
 
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
James Nelson
 
Benchmarking Perl (Chicago UniForum 2006)
Benchmarking Perl (Chicago UniForum 2006)Benchmarking Perl (Chicago UniForum 2006)
Benchmarking Perl (Chicago UniForum 2006)brian d foy
 
Whole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisWhole Genome Sequencing Analysis
Whole Genome Sequencing Analysis
Efi Athieniti
 
APIs and Synthetic Biology
APIs and Synthetic BiologyAPIs and Synthetic Biology
APIs and Synthetic Biology
Uri Laserson
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
QIAGEN
 
Split-plot Designs
Split-plot DesignsSplit-plot Designs
Split-plot Designs
richardchandler
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
Yanchang Zhao
 
Redis 101
Redis 101Redis 101
Redis 101
Doğan Can
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomicsUSC
 

Similar to Next-generation sequencing - variation discovery (20)

ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPs
 
Comparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerlComparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerl
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGS
 
UniView
UniViewUniView
UniView
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
M Sc Project
M Sc ProjectM Sc Project
M Sc Project
 
Modware
ModwareModware
Modware
 
Next-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plotNext-generation sequencing format and visualization with ngs.plot
Next-generation sequencing format and visualization with ngs.plot
 
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
IU Applied Machine Learning Class Final Project: ML Methods for Predicting Wi...
 
Benchmarking Perl (Chicago UniForum 2006)
Benchmarking Perl (Chicago UniForum 2006)Benchmarking Perl (Chicago UniForum 2006)
Benchmarking Perl (Chicago UniForum 2006)
 
Whole Genome Sequencing Analysis
Whole Genome Sequencing AnalysisWhole Genome Sequencing Analysis
Whole Genome Sequencing Analysis
 
APIs and Synthetic Biology
APIs and Synthetic BiologyAPIs and Synthetic Biology
APIs and Synthetic Biology
 
R
RR
R
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 
Split-plot Designs
Split-plot DesignsSplit-plot Designs
Split-plot Designs
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
Redis 101
Redis 101Redis 101
Redis 101
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomics
 

More from Jan Aerts

VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic Variation
Jan Aerts
 
Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Jan Aerts
 
Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?
Jan Aerts
 
Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Jan Aerts
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Jan Aerts
 
Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data Analysis
Jan Aerts
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualization
Jan Aerts
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformatics
Jan Aerts
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
Jan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
Jan Aerts
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing Consortium
Jan Aerts
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis Framework
Jan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
Jan Aerts
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysis
Jan Aerts
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
Jan Aerts
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
Jan Aerts
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...
Jan Aerts
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
Jan Aerts
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining components
Jan Aerts
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
Jan Aerts
 

More from Jan Aerts (20)

VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic Variation
 
Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?
 
Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?
 
Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)
 
Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data Analysis
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualization
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformatics
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing Consortium
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis Framework
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysis
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining components
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
 

Recently uploaded

Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
DhatriParmar
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
deeptiverma2406
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
Krisztián Száraz
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 

Recently uploaded (20)

Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
 
The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 

Next-generation sequencing - variation discovery

  • 1. [I0D51A] Bioinformatics: High-Throughput Analysis Next-generation sequencing. Part 3: Variation discovery Prof Jan Aerts Faculty of Engineering - ESAT/SCD jan.aerts@esat.kuleuven.be TA: Alejandro Sifrim (alejandro.sifrim@esat.kuleuven.be) 1
  • 3. Types of genomic variation SNPs vs structural variation 3
  • 4. A - Single nucleotide polymorphisms (SNPs) 4
  • 5. What are SNPs and why are they important? • SNP = single nucleotide polymorphism • It’s the differences that matter: • Human vs chimp: 98% identical (<2 differences every 100bp) • Between any 2 individuals: 1 difference every 1000bp • Disease: A or G == life or death • Mutations can result in: • change in level of transcription or translation (loss/gain) • change in protein structure 5
  • 6. 6
  • 7. SNP discovery - overview generate sequence reads ➡ map reads to reference sequence ➡ convert from read-based to position-based (“pileup”) ➡ identify differences 7
  • 8. 8
  • 9. 9
  • 10. 10
  • 11. 11
  • 12. Monet “Meule, Effet de Neige, le Matin” Not a trivial problem... 12
  • 13. Many SNP callers: • samtools • GATK • SOAPsnp • ... Read-based -> position-based Here: (1) samtools -> pileup; (2) GATK -> VCF 13
  • 15. 15
  • 16. pileup 16 1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<& 1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+ 1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6 1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<< 1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6< 1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&< 1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<< 1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<< alignment mapping quality
  • 17. Intermezzo: quality scores “Phred-score”: used for sequence quality as well as mapping quality Chance of 1/1000 that read is mapped at wrong position = 10-3 => phred- score = 30 Chance of 1/100 that read is mapped at wrong position = 10-2 => phred- score = 20 Sanger encoding: quality score 30 = “>” 17
  • 18. pileup 18 1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<& 1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+ 1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6 1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<< 1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6< 1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&< 1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<< 1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
  • 19. Heterozygous SNPs and the binomial distribution SNPs are bi-allelic => allele combinations for heterozygous SNP follow binomial distribution outcome = binary (red/white, head/tail, yes/no, A/G) probability p of the outcome of a single draw is the same for all draws E.g. 8 A’s + 12 G’s = SNP? hypothesis: heterozygous => nr of draws = 20; nr of “successes” = 8; probability p of outcome in single draw = 0.5 table with cumulative bionomial probabilities: http://bit.ly/cumul_binom_prob 8 A’s given coverage of 20 => cumulative probability = 0.252 > 0.05 => heterozygote 19
  • 20. 20
  • 21. samtools pileup -vcs -r 0.001 -l CCDS.txt -f human_b36_plus.fasta input.bam output.pileup samtools 21
  • 22. VCF file ##fileformat=VCFv3.3 ##FILTER=DP,"DP < 3 || DP > 1200" ##FILTER=QUAL,"QUAL < 25.0" ##FILTER=SnpCluster,"SNPs found in clusters" ##FORMAT=DP,1,Integer,"Read Depth" ##FORMAT=GQ,1,Integer,"Genotype Quality" ##FORMAT=GT,1,String,"Genotype" ##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))" ##INFO=DB,0,Flag,"dbSNP Membership" ##INFO=DP,1,Integer,"Total Depth" ##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction" ##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes" ##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>" ##INFO=MQ,1,Float,"RMS Mapping Quality" ##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads" ##INFO=QD,1,Float,"Variant Confidence/Quality by Depth" ##annotatorReference=human_b36_plus.fasta ##reference=human_b36_plus.fasta ##source=VariantAnnotator ##source=VariantFiltration #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam 1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.00 1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00 . . . 22
  • 23. VCF file 23 ##fileformat=VCFv3.3 ##FILTER=DP,"DP < 3 || DP > 1200" ##FILTER=QUAL,"QUAL < 25.0" ##FILTER=SnpCluster,"SNPs found in clusters" ##FORMAT=DP,1,Integer,"Read Depth" ##FORMAT=GQ,1,Integer,"Genotype Quality" ##FORMAT=GT,1,String,"Genotype" ##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))" ##INFO=DB,0,Flag,"dbSNP Membership" ##INFO=DP,1,Integer,"Total Depth" ##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction" ##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes" ##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>" ##INFO=MQ,1,Float,"RMS Mapping Quality" ##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads" ##INFO=QD,1,Float,"Variant Confidence/Quality by Depth" ##annotatorReference=human_b36_plus.fasta ##reference=human_b36_plus.fasta ##source=VariantAnnotator ##source=VariantFiltration #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam 1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.00 1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00 . . . file header column header actual data
  • 24. VCF file 24 INFO DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE FORMAT a_a:bwa057_b:picard.bam GT:DP:GQ 1/1:3:36.00 GT:DP:GQ 1/1:6:45.00 genotype depth genotype quality 1/1 = homozygous non-reference 0/1 = heterozygous
  • 25. java -Xmx6g -jar /path_to/GenomeAnalysisTK.jar -l INFO -R human_b36_plus.fasta -I input.bam -T UnifiedGenotyper --heterozygosity 0.001 -pl Solexa -varout output.vcf -vf VCF -mbq 20 -mmq 10 -stand_call_conf 30.0 --DBSNP dbsnp_129_b36_plus.rod GATK 25
  • 27. We have: chromosome + position + alleles We need: • in gene? • damaging? will be basis for filtering SIFT (http://sift.bii.a-star.edu/sg), annovar, PolyPhen, ... 27
  • 28. 28 3,81780820,1,T/C 2,43881517,1,A/T 2,43857514,1,T/C #SNP codon substitution region type prediction gene OMIM 3,81780820,1,T/C AGA-gGA R190G EXON CDS Nonsynonymous DAMAGING GBE1 POLYGLUCOSAN BODY DISEASE 2,43881517,1,A/T ATA-tTA I230L EXON CDS Nonsynonymous TOLERATED DYNC2LI1 2,43857514,1,T/C TTT-TcT F33S EXON CDS Nonsynonymous TOLERATED DYNC2LI1 SIFT input output
  • 29. 29 3,81780820,1,T/C 2,43881517,1,A/T 2,43857514,1,T/C #SNP codon substitution region type prediction gene OMIM 3,81780820,1,T/C AGA-gGA R190G EXON CDS Nonsynonymous DAMAGING GBE1 POLYGLUCOSAN BODY DISEASE 2,43881517,1,A/T ATA-tTA I230L EXON CDS Nonsynonymous TOLERATED DYNC2LI1 2,43857514,1,T/C TTT-TcT F33S EXON CDS Nonsynonymous TOLERATED DYNC2LI1 SIFT input output
  • 30. SNP filtering 2 aspects: • filtering to improve quality of SNP calls • filtering to find likely candidates 30
  • 31. Reduce false positives without increasing false negatives: • depth of coverage • mapping quality • SNP clusters • allelic balance (diploid genome) • number of reads with mq0 • consequence Filtering to improve quality 31
  • 32. java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantFiltration -R human_b36_plus.fasta -o output.vcf -B variant,VCF,input.vcf --clusterWindowSize 10 --filterExpression 'DP < 3 || DP > 1200' --filterName 'DP' --filterExpression 'QUAL < 20' --filterName 'QUAL' --filterExpression 'AB > 0.75 && DP > 40' --filterName 'AB' GATK 32
  • 33. VCF file 33 ##fileformat=VCFv3.3 ##FILTER=DP,"DP < 3 || DP > 1200" ##FILTER=QUAL,"QUAL < 25.0" ##FILTER=SnpCluster,"SNPs found in clusters" ##FORMAT=DP,1,Integer,"Read Depth" ##FORMAT=GQ,1,Integer,"Genotype Quality" ##FORMAT=GT,1,String,"Genotype" ##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))" ##INFO=DB,0,Flag,"dbSNP Membership" ##INFO=DP,1,Integer,"Total Depth" ##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction" ##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes" ##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>" ##INFO=MQ,1,Float,"RMS Mapping Quality" ##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads" ##INFO=QD,1,Float,"Variant Confidence/Quality by Depth" ##annotatorReference=human_b36_plus.fasta ##reference=human_b36_plus.fasta ##source=VariantAnnotator ##source=VariantFiltration #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam 1 856182 rs9988021 G A 36.00 DP DB;DP=2;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.00 1 866362 rs4372192 A G 45.00 PASSED DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00 . . .
  • 34. Transition/transversion ratio Transition/transversion ratio Ti/Tv random: Ti/Tv = 0.5 whole genome: Ti/Tv = 2.0-2.1 exome: Ti/Tv = 3-3.5 34
  • 35. Novel SNPs Number of novel SNPs exome: total = 20k - 25k novel = 1k - 3k 35
  • 36. Factors that influence SNP accuracy • sequencing technology • mapping algorithms and parameters • post-mapping manipulation duplicate removal, base quality recalibration, local realignment around indels, ... • SNP calling algorithms and parameters 36
  • 38. Filtering to find likely candidates Which are the most interesting? • only highqual: DP, QUAL, AB, but keep eye on Ti/Tv • novel • loss-of-function (stop gained, splice site, ...) or predicted to be damaging (non- synonymous) • found in multiple individuals • conserved • homozygous non-reference or compound heterozygous 38
  • 39. Disease model • dominant: a single heterozygous SNP is damaging • recessive: either homozygous non-reference or compound heterozygous necessary to lead to disease phenotype (e.g. phenylketonuria: cannot convert phenylalanine to tyrosine. Can lead to: mental retardation, microcephaly, ...) 39
  • 40. B - Structural variation 40
  • 41. Why bother? Iafrate et al, Nat Genet 2004 & Sebat et al, Science 2004 Redon et al, Nature 2006: 12% of genome is covered by copy number variable regions (270 individuals) => more nucleotide content per genome than SNPs •colour vision in primates •CCL3L1 copy number -> susceptibility to HIV •AMY1 copy number -> diet => “the dynamic genome” 41
  • 42. 42 Case 1: Evolution - chromosome fusion
  • 43. human chromosome 2 chimp chromosome 12 chimp chromosome 13 by Beth Kramer 43
  • 44. Molecular Biology of the Cell, 4th Edition colorectal cancer karyotype normal karyotype 44 Case 2: Cancer - rearranged genome
  • 45. Robberecht et al, 2010 45 Case 3: Embryogenesis - “abnormal” cells segmental chromosomal imbalances mosaicism for whole chromosomes uniparental isodisomy
  • 46. 46 Case 4: Down Syndrome = trisomy 21
  • 47. Types of structural variation Aerts & Tyler-Smith, In: Encyclopedia of Life Sciences, 2009 47
  • 48. Types of structural variation 48 Aerts & Tyler-Smith, In: Encyclopedia of Life Sciences, 2009 CNV = Copy Number Variation
  • 49. Copy number variation (CNV) Not equally distributed over genome: more pericentromeric and subtelomeric (especially in primates) Pericentromeric & subtelomeric regions: bias towards interchromosomal rearrangements; interstitial regions: bias towards intrachromosomal Generation of duplications: pericentromeric: 2-stage model (Sharp & Eichler, 2006) 1. series of seeding events: one of more progenitor loci transpose together to pericentromeric receptor => generates mosaic block of duplicated segments derived from different loci 2. inter- & intrachromosomal duplication => large blocks are duplicated near other centromeres subtelomeric: due to normal recombination: cross-overs lead to translocation of distal sequences between chromosomes 49
  • 50. Copy number variation and segmental duplications Close relationship between CNVs and segmental duplications (aka low-copy repeats aka LCRs; genomic regions with >1 copy that are at least 1kb long and have at least 90% sequence similarity): • Copy number variation that is fixed in population = segmental duplication (in other words: segmental duplications started out themselves as copy number variations) • Segmental duplications can stimulate formation of new CNVs due to NAHR (see later) ➡In human + chimp: 70-80% of inversions and 40% of insertions/deletions overlap with segmental duplications ➡80% of human segmental duplications arose after the divergence of Great Aples from the rest of the primates 50
  • 51. Effects of structural variation 51 Feuk et al, 2006
  • 52. Mechanisms of formation for structural variation 52 Gu et al, 2008
  • 53. Mechanisms: NAHR 53 NAHR = non-allelic homologous recombination often between segmental duplications •can recur •clustered breakpoints •larger Hastings et al, 2009
  • 54. Mechanisms: NHEJ 54 Gu et al, 2008 NHEJ = non-homologous end-joining pathway to repair double-strand breaks, but may lead to translocations and telomere fusion not associated with segmental duplications • more scattered • unique origins • smaller
  • 55. Mechanisms: FoSTeS 55 Hastings et al, 2009 FoSTeS = DNA replication fork-stalling and template switching can occur multiple times in series => can generate very complex rearrangements
  • 56. Feuk et al, 2006 Discovery of structural variation 56
  • 57. Approaches for discovery • karyotyping, fluorescent in situ hybridization (FISH) • array comparative genomic hybridization (aCGH) • next-generation sequencing: combination of: • read pair information • read depth information • split read information • for fine-mapping breakpoints: local assembly => identify signatures 57
  • 58. Feuk et al, 2006 Feuk et al, 2006 Feuk et al, 2006 FISH = fluorescent in-silico hybridization duplication inversion duplication Structural variation discovery using FISH 58
  • 59. Structural variation discovery using aCGH 59 Xie & Tammi, 2009 aCGH = array comparative genome hybridization
  • 61. 61 van de Wiel et al, 2010
  • 62. Structural variation discovery using next-generation sequencing General approaches: 1.Read depth 2.Read pairs 3.Split reads 62
  • 63. Structural variation discovery: read depth Xie & Tammi, 2009 63
  • 65. General principle • Similar to aCGH: using reference RD file (e.g. from 1000Genomes Project) • In theory: higher resolution, but noisier than aCGH • Algorithms not mature yet • More complex steps ➡Data binned 65
  • 66. 66
  • 67. 67 van de Wiel et al, 2010
  • 68. Xie & Tammi, 2009 68
  • 69. 69 CNV = copy number variation Combining CNV data for >1 individuals/samples
  • 70. 70 CNVR = copy number variation region CNVR = any region covered by at least 1 CNV
  • 71. 71 CNVE = copy number variation event CNVE = subgroups of CNVR with >= 50% reciprocal overlap
  • 72. Data normalization • Mainly: GC • Other: repeat-rich regions, mapping Q, ... • Fit linear model GC-content and RD => noise decreases 72
  • 73. Segmentation • Identify spikes • Many segmentational algorithms, e.g. GADA • Issues: setting parameters: when to cut off peaks? • Combine outputs from different runs with different parameters • Compare to known CNVs 73
  • 75. 7543 Xie & Tammi, 2009 peak
  • 76. 764443 Xie & Tammi, 2009 ...but is this?
  • 78. Drawbacks • Can only find unbalanced structural variation (i.e. CNVs) • How to assess specificity and sensitivity? => compare with known CNVs • Database of Genomic Variants DGV (http://projects.tcag.ca/variation/) • Decipher (http://decipher.sanger.ac.uk/) • Breakpoints: unknown • Different parameters for rare vs common CNVs => which? 78
  • 79. Structural variation discovery: read pairs 79 50 Korbel et al, 2007
  • 80. Discordant readpairs • Orientation • Distance • Plot insert size distribution for chromosome • Very long tail!! => difficult to set cutoff: 4 MAD or 0.01%? 80
  • 83. Read pair workflow 1. Map reads 2. Identify discordant pairs 3. Cluster on location 4. Filter on number of readpairs per cluster 5. Filter on read depth 6. Filter on mapping quality for read pairs 7. Identify signatures 8. (Optionally) create alternative reference 9. Validate 83
  • 89. Clustering • “standard clustering strategy” • only consider mate pairs that do not have concordant mappings • ignore read pairs that have more than one good mapping • clustering: use insert size distribution (e.g. 2x4 MAD) 89
  • 90. Clustering: issues • Ignores pairs that have >1 good mapping => no detection within repetitive regions (segmental duplications) • What cutoff for what is considered abnormal distance? (4 MAD? 0.01%? 2stdev?) • Low library quality of mix of libraries => multiple peaks in size distribution 90
  • 91. Filtering • On number of RPs per cluster • normally: n = 2 • for high coverage (e.g. 1000Genomes pilot 2: 80X): n = 5 • On drop in read depth and split reads • On (mappingQ x nrRP) • if published data available: look at specificity and sensitivity for different cutoffs mQ x nrRP • if not: very difficult 91
  • 92. Filtering: issues • Large insert size: low resolution for detecting breakpoints • Small insert size: low resolution for detecting complex regions 92
  • 94. Mapping • short subsequences => many possible mappings • solution: “anchored split mapping” (e.g. Pindel) 94 Medvedev et al, 2009
  • 95. Local reassembly • Aim: to determine breakpoints • Which reads? • for deletions: local reads • for insertions: hanging reads for read pairs with only one read mapped • (rather not: unmapped reads) • For large region: split up 95
  • 96. 96
  • 97. 97 Nielsen et al, 2009 sequence reads -> contigs (using sequence overlap) contigs -> scaffolds (using read-pair information) 1 scaffold contigs
  • 98. 98 + - read depth read pairs split reads conceptually simple only unbalanced (CNVs) low resolution wide range of types of variation complicated basepair resolution very small reads General conclusions NGS & structural variation (1)
  • 99. General conclusions NGS & structural variation (2) • Available algorithms: more to demonstrate technique than comprehensive solution • Difficult => different software = different results => “consensus set” • based on read pairs and split reads: many sets agree • based on read depth: totally different • sometimes drop in read depth, but no aberrant read pairs spanning the region => why??? • Mapper = critical; maq/bwa: only 1 mapping (=> many false negatives); mosaik, mrFAST: return more results 99
  • 100. Software for structural variation discovery 100 Medvedev et al, 2009
  • 104. References and software • Medvedev P et al. Nat Methods 6(11):S13-S20 (2009) • Lee S et al. Bioinformatics 24:i59-i67 (2008) • Hormozdiari F et al. Genome Res 19:1270-1278 (2009) • Campbell P et al. Nat Genet 40:722-729 (2008) • Ye K et al. Bioinformatics 25(21):2865-2871 (2009) • Chen K et al. Genome Res 19:1527-1741 (2009) • Yoon S et al. Genome Res 19:1586-1592 (2009) • Du J et al PLoS Comp Biol 5(7):e1000432 (2009) • Aerts J & Tyler-Smith C. In: Encyclopedia of Life Sciences (2009) • Hastings P et al Nat Rev Genet 10:551-564 (2009) 104
  • 106. Finding SNPs using Galaxy Based on the SAM-file you created in Galaxy in the last lecture, create a list of SNPs. You’ll first have to convert the SAM file to BAM, then create a pileup and finally filter the pileup (using “Filter pileup on coverage and SNPs”). Let this filter only return variants where the coverage is larger than 3 and the base quality is larger than 20. How many SNPs do you find? Calculate a histogram of the coverage over all SNPs (= column 4 in the filtered file you just created) 106
  • 107. Finding SNPs using samtools Using the SAM file you created in the last lecture on the linux command line: Generate a BAM file and sort it. Next, generate a pileup for that BAM file using ~jaerts/i0d51a/chr9.fa as the reference sequence. When doing this: only print the variant sites and also compute the reference sequence (run “samtools pileup” without arguments to get more info). How many SNPs are identified? Is the SNP at position 139,391,636 heterozygous or homozygous-non-reference? And the one at 139,399,365? Do you trust the SNP at 139,401,304? 107
  • 108. Annotating and filtering SNPs Download ~jaerts/i0d51a/sift.input to your own machine and then upload it to the SIFT website at http://bit.ly/siftsnp. Positions in this file are on Homo sapiens build NCBI36. Make sure to let SIFT send the results by email. How many SNPs are in/near genes? How many are in exons? What percentage of the SNPs is predicted damaging? 108
  • 109. Structural variation We’ll be looking at copy number variation using the cnv-seq package. This software is available from http://tiger.dbs.nus.edu.sg/cnv-seq/ We’ll be running the example from the cnv-seq tutorial at http:// tiger.dbs.nus.edu.sg/cnv-seq/doc/manual.pdf. (Read that!) •Log into the server mentioned on Toledo. •Calculate CNVs in the file ~jaerts/i0d51a/test_1.hits compared to ~jaerts/ i0d51a/ref_1.hits: /mnt/apps/cnv-seq/current/cnv-seq.pl --test ~jaerts/i0d51a/test_1.hits --ref ~jaerts/i0d51a/ref_1.hits -- genome chrom1 --log2 0.6 -p 0.001 --bigger-window 1.5 --annotate --minimum-windows 4 •Finally investigate in R. Start R by typing “R”. Then: library(cnv) data <- read.delim(’test_1.hits-vs-ref_1.hits.log2-0.6.pvalue-0.001.miw-4.cnv’) cnv.print(data) cnv.summary(data) plot.cnv(data, CNV=4, upstream=4e+6, downstream=4e+6) ggsave(’sample_1.pdf’) •Describe the main features in the plot. 109