Next-generation sequencing - variation discovery

8,891 views
8,632 views

Published on

Published in: Education
0 Comments
14 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
8,891
On SlideShare
0
From Embeds
0
Number of Embeds
47
Actions
Shares
0
Downloads
417
Comments
0
Likes
14
Embeds 0
No embeds

No notes for slide

Next-generation sequencing - variation discovery

  1. 1. [I0D51A] Bioinformatics: High-Throughput Analysis Next-generation sequencing. Part 3: Variation discovery Prof Jan Aerts Faculty of Engineering - ESAT/SCD jan.aerts@esat.kuleuven.be TA: Alejandro Sifrim (alejandro.sifrim@esat.kuleuven.be) 1
  2. 2. Context 2
  3. 3. Types of genomic variation SNPs vs structural variation 3
  4. 4. A - Single nucleotide polymorphisms (SNPs) 4
  5. 5. What are SNPs and why are they important? • SNP = single nucleotide polymorphism • It’s the differences that matter: • Human vs chimp: 98% identical (<2 differences every 100bp) • Between any 2 individuals: 1 difference every 1000bp • Disease: A or G == life or death • Mutations can result in: • change in level of transcription or translation (loss/gain) • change in protein structure 5
  6. 6. 6
  7. 7. SNP discovery - overview generate sequence reads ➡ map reads to reference sequence ➡ convert from read-based to position-based (“pileup”) ➡ identify differences 7
  8. 8. 8
  9. 9. 9
  10. 10. 10
  11. 11. 11
  12. 12. Monet “Meule, Effet de Neige, le Matin” Not a trivial problem... 12
  13. 13. Many SNP callers: • samtools • GATK • SOAPsnp • ... Read-based -> position-based Here: (1) samtools -> pileup; (2) GATK -> VCF 13
  14. 14. pileup 14
  15. 15. 15
  16. 16. pileup 16 1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<& 1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+ 1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6 1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<< 1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6< 1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&< 1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<< 1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<< alignment mapping quality
  17. 17. Intermezzo: quality scores “Phred-score”: used for sequence quality as well as mapping quality Chance of 1/1000 that read is mapped at wrong position = 10-3 => phred- score = 30 Chance of 1/100 that read is mapped at wrong position = 10-2 => phred- score = 20 Sanger encoding: quality score 30 = “>” 17
  18. 18. pileup 18 1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<& 1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+ 1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6 1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<< 1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6< 1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&< 1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<< 1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
  19. 19. Heterozygous SNPs and the binomial distribution SNPs are bi-allelic => allele combinations for heterozygous SNP follow binomial distribution outcome = binary (red/white, head/tail, yes/no, A/G) probability p of the outcome of a single draw is the same for all draws E.g. 8 A’s + 12 G’s = SNP? hypothesis: heterozygous => nr of draws = 20; nr of “successes” = 8; probability p of outcome in single draw = 0.5 table with cumulative bionomial probabilities: http://bit.ly/cumul_binom_prob 8 A’s given coverage of 20 => cumulative probability = 0.252 > 0.05 => heterozygote 19
  20. 20. 20
  21. 21. samtools pileup -vcs -r 0.001 -l CCDS.txt -f human_b36_plus.fasta input.bam output.pileup samtools 21
  22. 22. VCF file ##fileformat=VCFv3.3 ##FILTER=DP,"DP < 3 || DP > 1200" ##FILTER=QUAL,"QUAL < 25.0" ##FILTER=SnpCluster,"SNPs found in clusters" ##FORMAT=DP,1,Integer,"Read Depth" ##FORMAT=GQ,1,Integer,"Genotype Quality" ##FORMAT=GT,1,String,"Genotype" ##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))" ##INFO=DB,0,Flag,"dbSNP Membership" ##INFO=DP,1,Integer,"Total Depth" ##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction" ##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes" ##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>" ##INFO=MQ,1,Float,"RMS Mapping Quality" ##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads" ##INFO=QD,1,Float,"Variant Confidence/Quality by Depth" ##annotatorReference=human_b36_plus.fasta ##reference=human_b36_plus.fasta ##source=VariantAnnotator ##source=VariantFiltration #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam 1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.00 1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00 . . . 22
  23. 23. VCF file 23 ##fileformat=VCFv3.3 ##FILTER=DP,"DP < 3 || DP > 1200" ##FILTER=QUAL,"QUAL < 25.0" ##FILTER=SnpCluster,"SNPs found in clusters" ##FORMAT=DP,1,Integer,"Read Depth" ##FORMAT=GQ,1,Integer,"Genotype Quality" ##FORMAT=GT,1,String,"Genotype" ##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))" ##INFO=DB,0,Flag,"dbSNP Membership" ##INFO=DP,1,Integer,"Total Depth" ##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction" ##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes" ##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>" ##INFO=MQ,1,Float,"RMS Mapping Quality" ##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads" ##INFO=QD,1,Float,"Variant Confidence/Quality by Depth" ##annotatorReference=human_b36_plus.fasta ##reference=human_b36_plus.fasta ##source=VariantAnnotator ##source=VariantFiltration #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam 1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.00 1 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00 . . . file header column header actual data
  24. 24. VCF file 24 INFO DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE FORMAT a_a:bwa057_b:picard.bam GT:DP:GQ 1/1:3:36.00 GT:DP:GQ 1/1:6:45.00 genotype depth genotype quality 1/1 = homozygous non-reference 0/1 = heterozygous
  25. 25. java -Xmx6g -jar /path_to/GenomeAnalysisTK.jar -l INFO -R human_b36_plus.fasta -I input.bam -T UnifiedGenotyper --heterozygosity 0.001 -pl Solexa -varout output.vcf -vf VCF -mbq 20 -mmq 10 -stand_call_conf 30.0 --DBSNP dbsnp_129_b36_plus.rod GATK 25
  26. 26. SNP annotation 26 by piculak (Flickr)
  27. 27. We have: chromosome + position + alleles We need: • in gene? • damaging? will be basis for filtering SIFT (http://sift.bii.a-star.edu/sg), annovar, PolyPhen, ... 27
  28. 28. 28 3,81780820,1,T/C 2,43881517,1,A/T 2,43857514,1,T/C #SNP codon substitution region type prediction gene OMIM 3,81780820,1,T/C AGA-gGA R190G EXON CDS Nonsynonymous DAMAGING GBE1 POLYGLUCOSAN BODY DISEASE 2,43881517,1,A/T ATA-tTA I230L EXON CDS Nonsynonymous TOLERATED DYNC2LI1 2,43857514,1,T/C TTT-TcT F33S EXON CDS Nonsynonymous TOLERATED DYNC2LI1 SIFT input output
  29. 29. 29 3,81780820,1,T/C 2,43881517,1,A/T 2,43857514,1,T/C #SNP codon substitution region type prediction gene OMIM 3,81780820,1,T/C AGA-gGA R190G EXON CDS Nonsynonymous DAMAGING GBE1 POLYGLUCOSAN BODY DISEASE 2,43881517,1,A/T ATA-tTA I230L EXON CDS Nonsynonymous TOLERATED DYNC2LI1 2,43857514,1,T/C TTT-TcT F33S EXON CDS Nonsynonymous TOLERATED DYNC2LI1 SIFT input output
  30. 30. SNP filtering 2 aspects: • filtering to improve quality of SNP calls • filtering to find likely candidates 30
  31. 31. Reduce false positives without increasing false negatives: • depth of coverage • mapping quality • SNP clusters • allelic balance (diploid genome) • number of reads with mq0 • consequence Filtering to improve quality 31
  32. 32. java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantFiltration -R human_b36_plus.fasta -o output.vcf -B variant,VCF,input.vcf --clusterWindowSize 10 --filterExpression 'DP < 3 || DP > 1200' --filterName 'DP' --filterExpression 'QUAL < 20' --filterName 'QUAL' --filterExpression 'AB > 0.75 && DP > 40' --filterName 'AB' GATK 32
  33. 33. VCF file 33 ##fileformat=VCFv3.3 ##FILTER=DP,"DP < 3 || DP > 1200" ##FILTER=QUAL,"QUAL < 25.0" ##FILTER=SnpCluster,"SNPs found in clusters" ##FORMAT=DP,1,Integer,"Read Depth" ##FORMAT=GQ,1,Integer,"Genotype Quality" ##FORMAT=GT,1,String,"Genotype" ##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))" ##INFO=DB,0,Flag,"dbSNP Membership" ##INFO=DP,1,Integer,"Total Depth" ##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction" ##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes" ##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>" ##INFO=MQ,1,Float,"RMS Mapping Quality" ##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads" ##INFO=QD,1,Float,"Variant Confidence/Quality by Depth" ##annotatorReference=human_b36_plus.fasta ##reference=human_b36_plus.fasta ##source=VariantAnnotator ##source=VariantFiltration #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam 1 856182 rs9988021 G A 36.00 DP DB;DP=2;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.00 1 866362 rs4372192 A G 45.00 PASSED DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00 . . .
  34. 34. Transition/transversion ratio Transition/transversion ratio Ti/Tv random: Ti/Tv = 0.5 whole genome: Ti/Tv = 2.0-2.1 exome: Ti/Tv = 3-3.5 34
  35. 35. Novel SNPs Number of novel SNPs exome: total = 20k - 25k novel = 1k - 3k 35
  36. 36. Factors that influence SNP accuracy • sequencing technology • mapping algorithms and parameters • post-mapping manipulation duplicate removal, base quality recalibration, local realignment around indels, ... • SNP calling algorithms and parameters 36
  37. 37. Specificity vs sensitivity 37 truepositives false positives
  38. 38. Filtering to find likely candidates Which are the most interesting? • only highqual: DP, QUAL, AB, but keep eye on Ti/Tv • novel • loss-of-function (stop gained, splice site, ...) or predicted to be damaging (non- synonymous) • found in multiple individuals • conserved • homozygous non-reference or compound heterozygous 38
  39. 39. Disease model • dominant: a single heterozygous SNP is damaging • recessive: either homozygous non-reference or compound heterozygous necessary to lead to disease phenotype (e.g. phenylketonuria: cannot convert phenylalanine to tyrosine. Can lead to: mental retardation, microcephaly, ...) 39
  40. 40. B - Structural variation 40
  41. 41. Why bother? Iafrate et al, Nat Genet 2004 & Sebat et al, Science 2004 Redon et al, Nature 2006: 12% of genome is covered by copy number variable regions (270 individuals) => more nucleotide content per genome than SNPs •colour vision in primates •CCL3L1 copy number -> susceptibility to HIV •AMY1 copy number -> diet => “the dynamic genome” 41
  42. 42. 42 Case 1: Evolution - chromosome fusion
  43. 43. human chromosome 2 chimp chromosome 12 chimp chromosome 13 by Beth Kramer 43
  44. 44. Molecular Biology of the Cell, 4th Edition colorectal cancer karyotype normal karyotype 44 Case 2: Cancer - rearranged genome
  45. 45. Robberecht et al, 2010 45 Case 3: Embryogenesis - “abnormal” cells segmental chromosomal imbalances mosaicism for whole chromosomes uniparental isodisomy
  46. 46. 46 Case 4: Down Syndrome = trisomy 21
  47. 47. Types of structural variation Aerts & Tyler-Smith, In: Encyclopedia of Life Sciences, 2009 47
  48. 48. Types of structural variation 48 Aerts & Tyler-Smith, In: Encyclopedia of Life Sciences, 2009 CNV = Copy Number Variation
  49. 49. Copy number variation (CNV) Not equally distributed over genome: more pericentromeric and subtelomeric (especially in primates) Pericentromeric & subtelomeric regions: bias towards interchromosomal rearrangements; interstitial regions: bias towards intrachromosomal Generation of duplications: pericentromeric: 2-stage model (Sharp & Eichler, 2006) 1. series of seeding events: one of more progenitor loci transpose together to pericentromeric receptor => generates mosaic block of duplicated segments derived from different loci 2. inter- & intrachromosomal duplication => large blocks are duplicated near other centromeres subtelomeric: due to normal recombination: cross-overs lead to translocation of distal sequences between chromosomes 49
  50. 50. Copy number variation and segmental duplications Close relationship between CNVs and segmental duplications (aka low-copy repeats aka LCRs; genomic regions with >1 copy that are at least 1kb long and have at least 90% sequence similarity): • Copy number variation that is fixed in population = segmental duplication (in other words: segmental duplications started out themselves as copy number variations) • Segmental duplications can stimulate formation of new CNVs due to NAHR (see later) ➡In human + chimp: 70-80% of inversions and 40% of insertions/deletions overlap with segmental duplications ➡80% of human segmental duplications arose after the divergence of Great Aples from the rest of the primates 50
  51. 51. Effects of structural variation 51 Feuk et al, 2006
  52. 52. Mechanisms of formation for structural variation 52 Gu et al, 2008
  53. 53. Mechanisms: NAHR 53 NAHR = non-allelic homologous recombination often between segmental duplications •can recur •clustered breakpoints •larger Hastings et al, 2009
  54. 54. Mechanisms: NHEJ 54 Gu et al, 2008 NHEJ = non-homologous end-joining pathway to repair double-strand breaks, but may lead to translocations and telomere fusion not associated with segmental duplications • more scattered • unique origins • smaller
  55. 55. Mechanisms: FoSTeS 55 Hastings et al, 2009 FoSTeS = DNA replication fork-stalling and template switching can occur multiple times in series => can generate very complex rearrangements
  56. 56. Feuk et al, 2006 Discovery of structural variation 56
  57. 57. Approaches for discovery • karyotyping, fluorescent in situ hybridization (FISH) • array comparative genomic hybridization (aCGH) • next-generation sequencing: combination of: • read pair information • read depth information • split read information • for fine-mapping breakpoints: local assembly => identify signatures 57
  58. 58. Feuk et al, 2006 Feuk et al, 2006 Feuk et al, 2006 FISH = fluorescent in-silico hybridization duplication inversion duplication Structural variation discovery using FISH 58
  59. 59. Structural variation discovery using aCGH 59 Xie & Tammi, 2009 aCGH = array comparative genome hybridization
  60. 60. 60 http://www.breenlab.org/array.html
  61. 61. 61 van de Wiel et al, 2010
  62. 62. Structural variation discovery using next-generation sequencing General approaches: 1.Read depth 2.Read pairs 3.Split reads 62
  63. 63. Structural variation discovery: read depth Xie & Tammi, 2009 63
  64. 64. Workflow 1.Mapping 2.Read filtering 3.GC correction 4.Spike identification 5.Validation 64
  65. 65. General principle • Similar to aCGH: using reference RD file (e.g. from 1000Genomes Project) • In theory: higher resolution, but noisier than aCGH • Algorithms not mature yet • More complex steps ➡Data binned 65
  66. 66. 66
  67. 67. 67 van de Wiel et al, 2010
  68. 68. Xie & Tammi, 2009 68
  69. 69. 69 CNV = copy number variation Combining CNV data for >1 individuals/samples
  70. 70. 70 CNVR = copy number variation region CNVR = any region covered by at least 1 CNV
  71. 71. 71 CNVE = copy number variation event CNVE = subgroups of CNVR with >= 50% reciprocal overlap
  72. 72. Data normalization • Mainly: GC • Other: repeat-rich regions, mapping Q, ... • Fit linear model GC-content and RD => noise decreases 72
  73. 73. Segmentation • Identify spikes • Many segmentational algorithms, e.g. GADA • Issues: setting parameters: when to cut off peaks? • Combine outputs from different runs with different parameters • Compare to known CNVs 73
  74. 74. 74 Xie & Tammi, 2009
  75. 75. 7543 Xie & Tammi, 2009 peak
  76. 76. 764443 Xie & Tammi, 2009 ...but is this?
  77. 77. 77 Abysov et al
  78. 78. Drawbacks • Can only find unbalanced structural variation (i.e. CNVs) • How to assess specificity and sensitivity? => compare with known CNVs • Database of Genomic Variants DGV (http://projects.tcag.ca/variation/) • Decipher (http://decipher.sanger.ac.uk/) • Breakpoints: unknown • Different parameters for rare vs common CNVs => which? 78
  79. 79. Structural variation discovery: read pairs 79 50 Korbel et al, 2007
  80. 80. Discordant readpairs • Orientation • Distance • Plot insert size distribution for chromosome • Very long tail!! => difficult to set cutoff: 4 MAD or 0.01%? 80
  81. 81. Read pair signatures Medvedev et al, 2009 81
  82. 82. Real data 82
  83. 83. Read pair workflow 1. Map reads 2. Identify discordant pairs 3. Cluster on location 4. Filter on number of readpairs per cluster 5. Filter on read depth 6. Filter on mapping quality for read pairs 7. Identify signatures 8. (Optionally) create alternative reference 9. Validate 83
  84. 84. 84 figure by Klaudia Walter
  85. 85. 85 figure by Klaudia Walter
  86. 86. 86 figure by Klaudia Walter
  87. 87. 87 figure by Klaudia Walter
  88. 88. 88 figure by Klaudia Walter
  89. 89. Clustering • “standard clustering strategy” • only consider mate pairs that do not have concordant mappings • ignore read pairs that have more than one good mapping • clustering: use insert size distribution (e.g. 2x4 MAD) 89
  90. 90. Clustering: issues • Ignores pairs that have >1 good mapping => no detection within repetitive regions (segmental duplications) • What cutoff for what is considered abnormal distance? (4 MAD? 0.01%? 2stdev?) • Low library quality of mix of libraries => multiple peaks in size distribution 90
  91. 91. Filtering • On number of RPs per cluster • normally: n = 2 • for high coverage (e.g. 1000Genomes pilot 2: 80X): n = 5 • On drop in read depth and split reads • On (mappingQ x nrRP) • if published data available: look at specificity and sensitivity for different cutoffs mQ x nrRP • if not: very difficult 91
  92. 92. Filtering: issues • Large insert size: low resolution for detecting breakpoints • Small insert size: low resolution for detecting complex regions 92
  93. 93. Structural variation discovery: split reads 93
  94. 94. Mapping • short subsequences => many possible mappings • solution: “anchored split mapping” (e.g. Pindel) 94 Medvedev et al, 2009
  95. 95. Local reassembly • Aim: to determine breakpoints • Which reads? • for deletions: local reads • for insertions: hanging reads for read pairs with only one read mapped • (rather not: unmapped reads) • For large region: split up 95
  96. 96. 96
  97. 97. 97 Nielsen et al, 2009 sequence reads -> contigs (using sequence overlap) contigs -> scaffolds (using read-pair information) 1 scaffold contigs
  98. 98. 98 + - read depth read pairs split reads conceptually simple only unbalanced (CNVs) low resolution wide range of types of variation complicated basepair resolution very small reads General conclusions NGS & structural variation (1)
  99. 99. General conclusions NGS & structural variation (2) • Available algorithms: more to demonstrate technique than comprehensive solution • Difficult => different software = different results => “consensus set” • based on read pairs and split reads: many sets agree • based on read depth: totally different • sometimes drop in read depth, but no aberrant read pairs spanning the region => why??? • Mapper = critical; maq/bwa: only 1 mapping (=> many false negatives); mosaik, mrFAST: return more results 99
  100. 100. Software for structural variation discovery 100 Medvedev et al, 2009
  101. 101. Chris Yoon 101
  102. 102. Chris Yoon 102
  103. 103. 103 Websites http://www.broadinstitute.org/gatk http://samtools.sourceforge.net http://picard.sourceforge.net http://www.annotate-it.org http://bit.ly/siftsnp
  104. 104. References and software • Medvedev P et al. Nat Methods 6(11):S13-S20 (2009) • Lee S et al. Bioinformatics 24:i59-i67 (2008) • Hormozdiari F et al. Genome Res 19:1270-1278 (2009) • Campbell P et al. Nat Genet 40:722-729 (2008) • Ye K et al. Bioinformatics 25(21):2865-2871 (2009) • Chen K et al. Genome Res 19:1527-1741 (2009) • Yoon S et al. Genome Res 19:1586-1592 (2009) • Du J et al PLoS Comp Biol 5(7):e1000432 (2009) • Aerts J & Tyler-Smith C. In: Encyclopedia of Life Sciences (2009) • Hastings P et al Nat Rev Genet 10:551-564 (2009) 104
  105. 105. Exercises 105
  106. 106. Finding SNPs using Galaxy Based on the SAM-file you created in Galaxy in the last lecture, create a list of SNPs. You’ll first have to convert the SAM file to BAM, then create a pileup and finally filter the pileup (using “Filter pileup on coverage and SNPs”). Let this filter only return variants where the coverage is larger than 3 and the base quality is larger than 20. How many SNPs do you find? Calculate a histogram of the coverage over all SNPs (= column 4 in the filtered file you just created) 106
  107. 107. Finding SNPs using samtools Using the SAM file you created in the last lecture on the linux command line: Generate a BAM file and sort it. Next, generate a pileup for that BAM file using ~jaerts/i0d51a/chr9.fa as the reference sequence. When doing this: only print the variant sites and also compute the reference sequence (run “samtools pileup” without arguments to get more info). How many SNPs are identified? Is the SNP at position 139,391,636 heterozygous or homozygous-non-reference? And the one at 139,399,365? Do you trust the SNP at 139,401,304? 107
  108. 108. Annotating and filtering SNPs Download ~jaerts/i0d51a/sift.input to your own machine and then upload it to the SIFT website at http://bit.ly/siftsnp. Positions in this file are on Homo sapiens build NCBI36. Make sure to let SIFT send the results by email. How many SNPs are in/near genes? How many are in exons? What percentage of the SNPs is predicted damaging? 108
  109. 109. Structural variation We’ll be looking at copy number variation using the cnv-seq package. This software is available from http://tiger.dbs.nus.edu.sg/cnv-seq/ We’ll be running the example from the cnv-seq tutorial at http:// tiger.dbs.nus.edu.sg/cnv-seq/doc/manual.pdf. (Read that!) •Log into the server mentioned on Toledo. •Calculate CNVs in the file ~jaerts/i0d51a/test_1.hits compared to ~jaerts/ i0d51a/ref_1.hits: /mnt/apps/cnv-seq/current/cnv-seq.pl --test ~jaerts/i0d51a/test_1.hits --ref ~jaerts/i0d51a/ref_1.hits -- genome chrom1 --log2 0.6 -p 0.001 --bigger-window 1.5 --annotate --minimum-windows 4 •Finally investigate in R. Start R by typing “R”. Then: library(cnv) data <- read.delim(’test_1.hits-vs-ref_1.hits.log2-0.6.pvalue-0.001.miw-4.cnv’) cnv.print(data) cnv.summary(data) plot.cnv(data, CNV=4, upstream=4e+6, downstream=4e+6) ggsave(’sample_1.pdf’) •Describe the main features in the plot. 109

×