2014 Wellcome Trust Advances Course: NGS Course - Lecture2

1,764 views

Published on

Published in: Science, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,764
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
80
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

2014 Wellcome Trust Advances Course: NGS Course - Lecture2

  1. 1. WTAC NGS Course, Hinxton 12th April 2014 Lecture 2: Identification of SNPs, Indels, and structural variants Thomas Keane Sequence Variation Infrastructure Group WTSI Today's slides: ftp://ftp-mouse.sanger.ac.uk/other/tk2/WTAC-2014/Lecture2.pdf
  2. 2. WTAC NGS Course, Hinxton 12th April 2014 Lecture 2: Identification of SNPs, Indels, and structural variants ➢ VCF Format ➢ SNP/indel Identification ➢ Structural Variation
  3. 3. WTAC NGS Course, Hinxton 10th April 2014 VCF: Variant Call Format VCF is a standardised format for storing DNA polymorphism data ● SNPs, insertions, deletions and structural variants ● With rich annotations (e.g. context, predicted function, sequence data support) Indexed for fast data retrieval of variants from a range of positions Store variant information across many samples Record meta-data about the site ● dbSNP accession, filter status, validation status Very flexible format ● Arbitrary tags can be introduced to describe new types of variants ● No two VCF files are necessarily the same ● User extensible annotation fields supported ● Same event can be expressed in multiple ways by including different numbers ● Recommendation on VCF format website to ensure consistency
  4. 4. WTAC NGS Course, Hinxton 10th April 2014 VCF Format Header section and a data section Header ● Arbitrary number of meta-data information lines ● Starting with characters ‘##’ ● Column definition line starts with single ‘#’ Mandatory columns ● Chromosome (CHROM) ● Position of the start of the variant (POS) ● Unique identifiers of the variant (ID) ● Reference allele (REF) ● Comma separated list of alternate non-reference alleles (ALT) ● Phred-scaled quality score (QUAL) ● Site filtering information (FILTER) ● User extensible annotation (INFO)
  5. 5. WTAC NGS Course, Hinxton 10th April 2014 Example VCF (SNPs/indels)
  6. 6. WTAC NGS Course, Hinxton 10th April 2014 VCF Trivia 1 What version of the human reference genome was used? What does the DB INFO tag stand for? What does the ALT column contain? At position 17330, what is the total depth? What is the depth for sample NA00002? At position 17330, what is the genotype of NA00002? Which position is a tri-allelic SNP site? What sort of variant is at position 1234567? What is the genotype of NA00002?
  7. 7. WTAC NGS Course, Hinxton 10th April 2014 Functional Annotation VCF can store arbitrary ● INFO tags per site ● Genotype FORMAT tags Use tags to describe ● Genomic context of the variant (e.g. coding, intronic, non-coding, UTR, intergenic) ● Predicted functional consequence of the variant (e.g. synonymous/non- synonymous, protein structure change) ● Presence of the variant in other large resequencing studies Several tools for annotating a VCF ● SnpEff: http://snpeff.sourceforge.net/ ● Ensembl VEP: http://www.ensembl.org/info/docs/tools/vep/script/index.html ● FunSeq: http://funseq.gersteinlab.org/
  8. 8. WTAC NGS Course, Hinxton 10th April 2014 Ensembl - VEP "VEP determines the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions." Species must be included in either Ensembl OR Ensembl genomes Sequence ontology (SO) terms to describe genomic context Pubmed IDs for variants cited Output only the most severe consequence per variation. Online or off-line mode ● Off-line recommended for large numbers of variants (download relevant cache) Human specific annotations ● Sift - predicts whether an amino acid substitution affects protein function ● Polyphen - predicts impact of an amino acid substitution on the structure of human proteins ● 1000 genomes frequencies - global or per population
  9. 9. WTAC NGS Course, Hinxton 10th April 2014 VEP VCF VEP INFO tag: ● ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP. Format: Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Prote in_position|Amino_acids|Codons|Existing_variation|AA_MAF|EA_MAF|DISTANCE|S TRAND|CLIN_SIG|SYMBOL|SYMBOL_SOURCE|SIFT|PolyPhen|AFR_MAF|AMR_ MAF|ASN_MAF|EUR_MAF"> Example ● CSQ=T|ENSG00000238962|ENST00000458792|Transcript|upstream_gene_variant| |||||rs72779452|||3789|-1||RNU7-176P|HGNC|||0.02|0.10|0.07|0.17, T|ENSG00000143870|ENST00000404824|Transcript|synonymous_variant|474|102| 34|A|gcC/gcA|rs72779452||||-1||PDIA6|HGNC|||0.02|0.10|0.07|0.17, T|ENSG00000143870|ENST00000381611|Transcript|5_prime_UTR_variant|264|||||r s72779452||||-1||PDIA6|HGNC|||0.02|0.10|0.07|0.17
  10. 10. WTAC NGS Course, Hinxton 10th April 2014 More Information VCF ● http://bioinformatics.oxfordjournals.org/content/27/15/2156.full ● http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf- variant-call-format-version-41 VCFTools ● http://vcftools.sourceforge.net GATK ● http://www.broadinstitute.org/gatk/ ● http://www.broadinstitute.org/gatk/guide/article?id=1268 VCF Annotation ● Ensembl VEP: http://www.ensembl.org/info/docs/tools/vep/index.html ● SNPeff: http://snpeff.sourceforge.net/ ● Anntools: http://anntools.sourceforge.net/
  11. 11. WTAC NGS Course, Hinxton 12th April 2014 Lecture 2: Identification of SNPs, Indels, and structural variants ➢ VCF Format ➢ SNP/indel Identification ➢ Structural Variation
  12. 12. WTAC NGS Course, Hinxton 12th April 2014 SNP Identification SNP - single nucleotide polymorphisms ● Examine the bases aligned to position and look for differences SNP discovery vs genotyping ● Finding new variant sites ● Determining the genotype at a set of already known sites Factors to consider when calling SNPs ● Base call qualities of each supporting base ● Proximity to ○ Small indel ○ Homopolymer run (>4-5bp for 454 and >10bp for illumina) ● Mapping qualities of the reads supporting the SNP ○ Low mapping qualities indicates repetitive sequence ● Read length ○ Possible to align reads with high confidence to larger portion of the genome with longer reads ● Paired reads ● Sequencing depth
  13. 13. WTAC NGS Course, Hinxton 12th April 2014 Mouse SNP
  14. 14. WTAC NGS Course, Hinxton 12th April 2014 Read Length vs. Uniqueness
  15. 15. WTAC NGS Course, Hinxton 12th April 2014 Inaccessible Genome
  16. 16. WTAC NGS Course, Hinxton 12th April 2014 Is this a real SNP?
  17. 17. WTAC NGS Course, Hinxton 12th April 2014 Evaluating SNPs Specificity vs sensitivity ● False positives vs. false negatives Desirable to have high sensitivity and specificity Sensitivity ● External sources of validation Specificity ● Test a random selection of snps by another technology ● e.g. Sequenom, Sanger sequencing… Receiver operator curves to investigate effects of varying parameters
  18. 18. WTAC NGS Course, Hinxton 12th April 2014 Known Systematic Biases Many biases can be introduced in either sample preparation, sequencing process, computational alignment steps etc. ● Can generate false positive SNPs/indels Potential biases ● Strand bias ● End distance bias ● Consistency across replicates/libraries ● Variant distance bias VCF Tools ● Soft filter variants file for these biases ● Variants kept in the file - just annotated with potential bias affecting the variant
  19. 19. WTAC NGS Course, Hinxton 12th April 2014 Strand Bias
  20. 20. WTAC NGS Course, Hinxton 12th April 2014 End Distance Bias
  21. 21. WTAC NGS Course, Hinxton 12th April 2014 Variant Distance Bias
  22. 22. WTAC NGS Course, Hinxton 12th April 2014 Reproducibility
  23. 23. WTAC NGS Course, Hinxton 12th April 2014 Future of Variant Calling? Current approaches ● Rely heavily on the supplied alignment ● Largely site based, don't examine local haplotype Local denovo assembly based variant callers ● Calls SNP, INDEL, MNP and small SV simultaneously ● Can removes mapping artifacts ● e.g. GATK haplotype caller
  24. 24. WTAC NGS Course, Hinxton 12th April 2014 Haplotype Based Calling - GATK
  25. 25. WTAC NGS Course, Hinxton 12th April 2014 Lecture 2: Identification of SNPs, Indels, and structural variants ➢ VCF Format ➢ SNP/indel Identification ➢ Structural Variation
  26. 26. WTAC NGS Course, Hinxton 12th April 2014 Genomic Structural Variation Large DNA rearrangements (>100bp) Frequent causes of disease ● Referred to as genomic disorders ● Mendelian diseases or complex traits such as behaviors ● E.g. increase in gene dosage due to increase in copy number ● Prevalent in cancer genomes Many types of genomic structural variation (SV) ● Insertions, deletions, copy number changes, inversions, translocations & complex events Comparative genomic hybridization (CGH) traditionally used to for copy number discovery ● CNVs of 1-50 kb in size have been under-ascertained Next-gen sequencing revolutionised field of SV discovery ● Parallel sequencing of ends of large numbers of DNA fragments ● Examine alignment distance of reads to discover presence of genomic rearrangements ● Resolution down to ~100bp
  27. 27. WTAC NGS Course, Hinxton 12th April 2014 Human Disease Stankiewicz and Lupski (2010) Ann. Rev. Med.
  28. 28. WTAC NGS Course, Hinxton 12th April 2014 Structural Variation Several types of structural variations (SVs) ● Large Insertions/deletions ● Inversions ● Translocations Read pair information used to detect these events ● Paired end sequencing of either end of DNA fragment ● Observe deviations from the expected fragment size ● Presence/absence of mate pairs
  29. 29. WTAC NGS Course, Hinxton 12th April 2014 Structural Variation Types
  30. 30. WTAC NGS Course, Hinxton 10th April 2014 Fragment Size QC
  31. 31. WTAC NGS Course, Hinxton 10th April 2014 What is this?
  32. 32. WTAC NGS Course, Hinxton 12th April 2014 What is this?
  33. 33. WTAC NGS Course, Hinxton 12th April 2014 What is this?
  34. 34. WTAC NGS Course, Hinxton 12th April 2014 Mobile Element Insertions Transposons are segments of DNA that can move within the genome ● A minimal ‘genome’ - ability to replicate and change location ● Relics of ancient viral infections Dominate landscape of mammalian genomes ● 38-45% of rodent and primate genomes ● Genome size proportional to number of TEs Class 1 (RNA intermediate) and 2 (DNA intermediate) Potent genetic mutagens ● Disrupt expression of genes ● Genome reorganisation and evolution ● Transduction of flanking sequence Species specific families ● Human: Alu, L1, SVA ● Mouse: SINE, LINE, ERV Many other families in other species
  35. 35. WTAC NGS Course, Hinxton 12th April 2014 Human Mobile Elements
  36. 36. WTAC NGS Course, Hinxton 12th April 2014 Mobile Element Insertions
  37. 37. WTAC NGS Course, Hinxton 12th April 2014 Mouse Example - LookSeq
  38. 38. WTAC NGS Course, Hinxton 12th April 2014 Human Alu - IGV
  39. 39. WTAC NGS Course, Hinxton 12th April 2014 Detecting Mobile Element Insertions Most algorithms for locating non-reference mobile elements operate in a similar manner Goal: Detect all read pairs where one-end is flanking the insertion point and mate is in the inserted sequence Pseudo algorithm ● Read through BAM file and make list of all discordant read pairs ● Filter the reads where one end is similar to your library of mobile elements ● Remove anchor reads with low mapping quality ● Cluster the anchor reads and examine breakpoint ● Filter out any clusters close to annotated elements of the same type
  40. 40. WTAC NGS Course, Hinxton 12th April 2014 1000 Genomes CEU Trio Typical human sample ~900-1000 non-reference mobile elements ● ~800 Alu elements, ~100 L1 Why are there 44 calls private to the child?
  41. 41. WTAC NGS Course, Hinxton 12th April 2014 Mobile Element Software RetroSeq: https://github.com/tk2/RetroSeq VariationHunter: http://compbio.cs.sfu.ca/strvar. htm T-LEX: http://petrov.stanford.edu/cgi-bin/Tlex. html Tea: http://compbio.med.harvard.edu/Tea/

×