SlideShare a Scribd company logo
WTAC NGS Course, Hinxton 12th
April 2014
Lecture 2: Identification of SNPs, Indels, and
structural variants
Thomas Keane
Sequence Variation Infrastructure Group
WTSI
Today's slides: ftp://ftp-mouse.sanger.ac.uk/other/tk2/WTAC-2014/Lecture2.pdf
WTAC NGS Course, Hinxton 12th
April 2014
Lecture 2: Identification of SNPs, Indels, and structural
variants
➢ VCF Format
➢ SNP/indel Identification
➢ Structural Variation
WTAC NGS Course, Hinxton 10th
April 2014
VCF: Variant Call Format
VCF is a standardised format for storing DNA polymorphism data
● SNPs, insertions, deletions and structural variants
● With rich annotations (e.g. context, predicted function, sequence data support)
Indexed for fast data retrieval of variants from a range of positions
Store variant information across many samples
Record meta-data about the site
● dbSNP accession, filter status, validation status
Very flexible format
● Arbitrary tags can be introduced to describe new types of variants
● No two VCF files are necessarily the same
● User extensible annotation fields supported
● Same event can be expressed in multiple ways by including different numbers
● Recommendation on VCF format website to ensure consistency
WTAC NGS Course, Hinxton 10th
April 2014
VCF Format
Header section and a data section
Header
● Arbitrary number of meta-data information lines
● Starting with characters ‘##’
● Column definition line starts with single ‘#’
Mandatory columns
● Chromosome (CHROM)
● Position of the start of the variant (POS)
● Unique identifiers of the variant (ID)
● Reference allele (REF)
● Comma separated list of alternate non-reference alleles (ALT)
● Phred-scaled quality score (QUAL)
● Site filtering information (FILTER)
● User extensible annotation (INFO)
WTAC NGS Course, Hinxton 10th
April 2014
Example VCF (SNPs/indels)
WTAC NGS Course, Hinxton 10th
April 2014
VCF Trivia 1
What version of the human reference genome was used?
What does the DB INFO tag stand for?
What does the ALT column contain?
At position 17330, what is the total depth? What is the depth for sample NA00002?
At position 17330, what is the genotype of NA00002?
Which position is a tri-allelic SNP site?
What sort of variant is at position 1234567? What is the genotype of NA00002?
WTAC NGS Course, Hinxton 10th
April 2014
Functional Annotation
VCF can store arbitrary
● INFO tags per site
● Genotype FORMAT tags
Use tags to describe
● Genomic context of the variant (e.g. coding, intronic, non-coding, UTR,
intergenic)
● Predicted functional consequence of the variant (e.g. synonymous/non-
synonymous, protein structure change)
● Presence of the variant in other large resequencing studies
Several tools for annotating a VCF
● SnpEff: http://snpeff.sourceforge.net/
● Ensembl VEP: http://www.ensembl.org/info/docs/tools/vep/script/index.html
● FunSeq: http://funseq.gersteinlab.org/
WTAC NGS Course, Hinxton 10th
April 2014
Ensembl - VEP
"VEP determines the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants)
on genes, transcripts, and protein sequence, as well as regulatory regions."
Species must be included in either Ensembl OR Ensembl genomes
Sequence ontology (SO) terms to describe genomic context
Pubmed IDs for variants cited
Output only the most severe consequence per variation.
Online or off-line mode
● Off-line recommended for large numbers of variants (download relevant cache)
Human specific annotations
● Sift - predicts whether an amino acid substitution affects protein function
● Polyphen - predicts impact of an amino acid substitution on the structure of human proteins
● 1000 genomes frequencies - global or per population
WTAC NGS Course, Hinxton 10th
April 2014
VEP VCF
VEP INFO tag:
● ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as
predicted by VEP. Format:
Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Prote
in_position|Amino_acids|Codons|Existing_variation|AA_MAF|EA_MAF|DISTANCE|S
TRAND|CLIN_SIG|SYMBOL|SYMBOL_SOURCE|SIFT|PolyPhen|AFR_MAF|AMR_
MAF|ASN_MAF|EUR_MAF">
Example
● CSQ=T|ENSG00000238962|ENST00000458792|Transcript|upstream_gene_variant|
|||||rs72779452|||3789|-1||RNU7-176P|HGNC|||0.02|0.10|0.07|0.17,
T|ENSG00000143870|ENST00000404824|Transcript|synonymous_variant|474|102|
34|A|gcC/gcA|rs72779452||||-1||PDIA6|HGNC|||0.02|0.10|0.07|0.17,
T|ENSG00000143870|ENST00000381611|Transcript|5_prime_UTR_variant|264|||||r
s72779452||||-1||PDIA6|HGNC|||0.02|0.10|0.07|0.17
WTAC NGS Course, Hinxton 10th
April 2014
More Information
VCF
● http://bioinformatics.oxfordjournals.org/content/27/15/2156.full
● http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-
variant-call-format-version-41
VCFTools
● http://vcftools.sourceforge.net
GATK
● http://www.broadinstitute.org/gatk/
● http://www.broadinstitute.org/gatk/guide/article?id=1268
VCF Annotation
● Ensembl VEP: http://www.ensembl.org/info/docs/tools/vep/index.html
● SNPeff: http://snpeff.sourceforge.net/
● Anntools: http://anntools.sourceforge.net/
WTAC NGS Course, Hinxton 12th
April 2014
Lecture 2: Identification of SNPs, Indels, and structural
variants
➢ VCF Format
➢ SNP/indel Identification
➢ Structural Variation
WTAC NGS Course, Hinxton 12th
April 2014
SNP Identification
SNP - single nucleotide polymorphisms
● Examine the bases aligned to position and look for differences
SNP discovery vs genotyping
● Finding new variant sites
● Determining the genotype at a set of already known sites
Factors to consider when calling SNPs
● Base call qualities of each supporting base
● Proximity to
○ Small indel
○ Homopolymer run (>4-5bp for 454 and >10bp for illumina)
● Mapping qualities of the reads supporting the SNP
○ Low mapping qualities indicates repetitive sequence
● Read length
○ Possible to align reads with high confidence to larger portion of the genome with
longer reads
● Paired reads
● Sequencing depth
WTAC NGS Course, Hinxton 12th
April 2014
Mouse SNP
WTAC NGS Course, Hinxton 12th
April 2014
Read Length vs. Uniqueness
WTAC NGS Course, Hinxton 12th
April 2014
Inaccessible Genome
WTAC NGS Course, Hinxton 12th
April 2014
Is this a real SNP?
WTAC NGS Course, Hinxton 12th
April 2014
Evaluating SNPs
Specificity vs sensitivity
● False positives vs. false negatives
Desirable to have high sensitivity and specificity
Sensitivity
● External sources of validation
Specificity
● Test a random selection of snps by another technology
● e.g. Sequenom, Sanger sequencing…
Receiver operator curves to investigate effects of varying parameters
WTAC NGS Course, Hinxton 12th
April 2014
Known Systematic Biases
Many biases can be introduced in either sample preparation, sequencing
process, computational alignment steps etc.
● Can generate false positive SNPs/indels
Potential biases
● Strand bias
● End distance bias
● Consistency across replicates/libraries
● Variant distance bias
VCF Tools
● Soft filter variants file for these biases
● Variants kept in the file - just annotated with potential bias affecting the
variant
WTAC NGS Course, Hinxton 12th
April 2014
Strand Bias
WTAC NGS Course, Hinxton 12th
April 2014
End Distance Bias
WTAC NGS Course, Hinxton 12th
April 2014
Variant Distance Bias
WTAC NGS Course, Hinxton 12th
April 2014
Reproducibility
WTAC NGS Course, Hinxton 12th
April 2014
Future of Variant Calling?
Current approaches
● Rely heavily on the supplied alignment
● Largely site based, don't examine local haplotype
Local denovo assembly based variant callers
● Calls SNP, INDEL, MNP and small SV
simultaneously
● Can removes mapping artifacts
● e.g. GATK haplotype caller
WTAC NGS Course, Hinxton 12th
April 2014
Haplotype Based Calling - GATK
WTAC NGS Course, Hinxton 12th
April 2014
Lecture 2: Identification of SNPs, Indels, and structural
variants
➢ VCF Format
➢ SNP/indel Identification
➢ Structural Variation
WTAC NGS Course, Hinxton 12th
April 2014
Genomic Structural Variation
Large DNA rearrangements (>100bp)
Frequent causes of disease
● Referred to as genomic disorders
● Mendelian diseases or complex traits such as behaviors
● E.g. increase in gene dosage due to increase in copy number
● Prevalent in cancer genomes
Many types of genomic structural variation (SV)
● Insertions, deletions, copy number changes, inversions, translocations & complex events
Comparative genomic hybridization (CGH) traditionally used to for copy number discovery
● CNVs of 1-50 kb in size have been under-ascertained
Next-gen sequencing revolutionised field of SV discovery
● Parallel sequencing of ends of large numbers of DNA fragments
● Examine alignment distance of reads to discover presence of genomic rearrangements
● Resolution down to ~100bp
WTAC NGS Course, Hinxton 12th
April 2014
Human Disease
Stankiewicz and Lupski (2010) Ann. Rev. Med.
WTAC NGS Course, Hinxton 12th
April 2014
Structural Variation
Several types of structural variations (SVs)
● Large Insertions/deletions
● Inversions
● Translocations
Read pair information used to detect these events
● Paired end sequencing of either end of DNA
fragment
● Observe deviations from the expected fragment
size
● Presence/absence of mate pairs
WTAC NGS Course, Hinxton 12th
April 2014
Structural Variation Types
WTAC NGS Course, Hinxton 10th
April 2014
Fragment Size QC
WTAC NGS Course, Hinxton 10th
April 2014
What is this?
WTAC NGS Course, Hinxton 12th
April 2014
What is this?
WTAC NGS Course, Hinxton 12th
April 2014
What is this?
WTAC NGS Course, Hinxton 12th
April 2014
Mobile Element Insertions
Transposons are segments of DNA that can move within the genome
● A minimal ‘genome’ - ability to replicate and change location
● Relics of ancient viral infections
Dominate landscape of mammalian genomes
● 38-45% of rodent and primate genomes
● Genome size proportional to number of TEs
Class 1 (RNA intermediate) and 2 (DNA intermediate)
Potent genetic mutagens
● Disrupt expression of genes
● Genome reorganisation and evolution
● Transduction of flanking sequence
Species specific families
● Human: Alu, L1, SVA
● Mouse: SINE, LINE, ERV
Many other families in other species
WTAC NGS Course, Hinxton 12th
April 2014
Human Mobile Elements
WTAC NGS Course, Hinxton 12th
April 2014
Mobile Element Insertions
WTAC NGS Course, Hinxton 12th
April 2014
Mouse Example - LookSeq
WTAC NGS Course, Hinxton 12th
April 2014
Human Alu - IGV
WTAC NGS Course, Hinxton 12th
April 2014
Detecting Mobile Element Insertions
Most algorithms for locating non-reference mobile elements operate in a similar manner
Goal: Detect all read pairs where one-end is flanking the insertion point and mate is in the
inserted sequence
Pseudo algorithm
● Read through BAM file and make list of all discordant read pairs
● Filter the reads where one end is similar to your library of mobile elements
● Remove anchor reads with low mapping quality
● Cluster the anchor reads and examine breakpoint
● Filter out any clusters close to annotated elements of the same type
WTAC NGS Course, Hinxton 12th
April 2014
1000 Genomes CEU Trio
Typical human sample ~900-1000 non-reference mobile elements
● ~800 Alu elements, ~100 L1
Why are there 44 calls private to the child?
WTAC NGS Course, Hinxton 12th
April 2014
Mobile Element Software
RetroSeq: https://github.com/tk2/RetroSeq
VariationHunter: http://compbio.cs.sfu.ca/strvar.
htm
T-LEX: http://petrov.stanford.edu/cgi-bin/Tlex.
html
Tea: http://compbio.med.harvard.edu/Tea/

More Related Content

What's hot

Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
Thomas Keane
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
AGRF_Ltd
 
Ngs intro_v6_public
 Ngs intro_v6_public Ngs intro_v6_public
Ngs intro_v6_public
François PAILLIER
 
Kogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisKogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysis
Junsu Ko
 
NGS Pipeline Preparation - Tools Selection
NGS Pipeline Preparation - Tools SelectionNGS Pipeline Preparation - Tools Selection
NGS Pipeline Preparation - Tools Selection
Minesh A. Jethva
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualize
Ann Loraine
 
NGS Data Preprocessing
NGS Data PreprocessingNGS Data Preprocessing
NGS Data Preprocessing
cursoNGS
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
Denis C. Bauer
 
How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)
James Hadfield
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
Nikolay Vyahhi
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
Vall d'Hebron Institute of Research (VHIR)
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3
BITS
 
DEseq, voom and vst
DEseq, voom and vstDEseq, voom and vst
DEseq, voom and vst
Qiang Kou
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
Aureliano Bombarely
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGS
cursoNGS
 
RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities
Paolo Dametto
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platforms
AllSeq
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-Seq
cursoNGS
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Manikhandan Mudaliar
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
Li Shen
 

What's hot (20)

Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Ngs intro_v6_public
 Ngs intro_v6_public Ngs intro_v6_public
Ngs intro_v6_public
 
Kogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisKogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysis
 
NGS Pipeline Preparation - Tools Selection
NGS Pipeline Preparation - Tools SelectionNGS Pipeline Preparation - Tools Selection
NGS Pipeline Preparation - Tools Selection
 
wings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualizewings2014 Workshop 1 Design, sequence, align, count, visualize
wings2014 Workshop 1 Design, sequence, align, count, visualize
 
NGS Data Preprocessing
NGS Data PreprocessingNGS Data Preprocessing
NGS Data Preprocessing
 
Transcript detection in RNAseq
Transcript detection in RNAseqTranscript detection in RNAseq
Transcript detection in RNAseq
 
How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)How to cluster and sequence an ngs library (james hadfield160416)
How to cluster and sequence an ngs library (james hadfield160416)
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
Data analysis pipelines for NGS applications
Data analysis pipelines for NGS applicationsData analysis pipelines for NGS applications
Data analysis pipelines for NGS applications
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3
 
DEseq, voom and vst
DEseq, voom and vstDEseq, voom and vst
DEseq, voom and vst
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Discovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGSDiscovery and annotation of variants by exome analysis using NGS
Discovery and annotation of variants by exome analysis using NGS
 
RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities RNA sequencing: advances and opportunities
RNA sequencing: advances and opportunities
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platforms
 
Differential expression in RNA-Seq
Differential expression in RNA-SeqDifferential expression in RNA-Seq
Differential expression in RNA-Seq
 
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 

Viewers also liked

New Strategy to detect SNPs
New Strategy to detect SNPsNew Strategy to detect SNPs
New Strategy to detect SNPs
Miguel Galves
 
Non-synonymous SNP ID
Non-synonymous SNP IDNon-synonymous SNP ID
Non-synonymous SNP ID
cgstorer
 
Next generation sequencing for snp discovery(final)
Next generation sequencing for snp discovery(final)Next generation sequencing for snp discovery(final)
Next generation sequencing for snp discovery(final)
UAS,GKVK<BANGALORE
 
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...
Thermo Fisher Scientific
 
L11 dna__polymorphisms__mutations_and_genetic_diseases4
L11  dna__polymorphisms__mutations_and_genetic_diseases4L11  dna__polymorphisms__mutations_and_genetic_diseases4
L11 dna__polymorphisms__mutations_and_genetic_diseases4
MUBOSScz
 
Single Nucleotide Polymorphism Analysis (SNPs)
Single Nucleotide Polymorphism Analysis (SNPs)Single Nucleotide Polymorphism Analysis (SNPs)
Single Nucleotide Polymorphism Analysis (SNPs)
Data Science Thailand
 
Snp
SnpSnp
Single nucleotide polymorphism
Single nucleotide polymorphismSingle nucleotide polymorphism
Single nucleotide polymorphism
Bipul Das
 

Viewers also liked (8)

New Strategy to detect SNPs
New Strategy to detect SNPsNew Strategy to detect SNPs
New Strategy to detect SNPs
 
Non-synonymous SNP ID
Non-synonymous SNP IDNon-synonymous SNP ID
Non-synonymous SNP ID
 
Next generation sequencing for snp discovery(final)
Next generation sequencing for snp discovery(final)Next generation sequencing for snp discovery(final)
Next generation sequencing for snp discovery(final)
 
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...
Moving Towards a Validated High Throughput Sequencing Solution for Human Iden...
 
L11 dna__polymorphisms__mutations_and_genetic_diseases4
L11  dna__polymorphisms__mutations_and_genetic_diseases4L11  dna__polymorphisms__mutations_and_genetic_diseases4
L11 dna__polymorphisms__mutations_and_genetic_diseases4
 
Single Nucleotide Polymorphism Analysis (SNPs)
Single Nucleotide Polymorphism Analysis (SNPs)Single Nucleotide Polymorphism Analysis (SNPs)
Single Nucleotide Polymorphism Analysis (SNPs)
 
Snp
SnpSnp
Snp
 
Single nucleotide polymorphism
Single nucleotide polymorphismSingle nucleotide polymorphism
Single nucleotide polymorphism
 

Similar to 2014 Wellcome Trust Advances Course: NGS Course - Lecture2

The Clinical Significance of Transcript Alignment Discrepancies … and tools t...
The Clinical Significance of Transcript Alignment Discrepancies … and tools t...The Clinical Significance of Transcript Alignment Discrepancies … and tools t...
The Clinical Significance of Transcript Alignment Discrepancies … and tools t...
Human Variome Project
 
2015 09-29-sbc322-methods.key
2015 09-29-sbc322-methods.key2015 09-29-sbc322-methods.key
2015 09-29-sbc322-methods.key
Yannick Wurm
 
Examining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingExamining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencing
Stephen Turner
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
HAMNAHAMNA8
 
Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...
OECD Environment
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
Delaina Hawkins
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
Golden Helix Inc
 
Browsing Genes, Variation and Regulation data with Ensembl
Browsing Genes, Variation and Regulation data with EnsemblBrowsing Genes, Variation and Regulation data with Ensembl
Browsing Genes, Variation and Regulation data with Ensembl
Denise Carvalho-Silva, PhD
 
Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015
Kim D. Pruitt
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysis
Despoina Kalfakakou
 
Digital RNAseq Technology Introduction: Digital RNAseq Webinar Part 1
Digital RNAseq Technology Introduction: Digital RNAseq Webinar Part 1Digital RNAseq Technology Introduction: Digital RNAseq Webinar Part 1
Digital RNAseq Technology Introduction: Digital RNAseq Webinar Part 1
QIAGEN
 
NGS Presentation .pptx
NGS Presentation  .pptxNGS Presentation  .pptx
NGS Presentation .pptx
MalihaTanveer1
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
Long Pei
 
20140710 3 l_paul_ercc2.0_workshop
20140710 3 l_paul_ercc2.0_workshop20140710 3 l_paul_ercc2.0_workshop
20140710 3 l_paul_ercc2.0_workshop
External RNA Controls Consortium
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GenomeInABottle
 
201007131ghfjjkklllllllll14012254-152438.ppt
201007131ghfjjkklllllllll14012254-152438.ppt201007131ghfjjkklllllllll14012254-152438.ppt
201007131ghfjjkklllllllll14012254-152438.ppt
nimrah farooq
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
GenomeInABottle
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
GenomeInABottle
 
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
VHIR Vall d’Hebron Institut de Recerca
 
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
Jonathan Eisen
 

Similar to 2014 Wellcome Trust Advances Course: NGS Course - Lecture2 (20)

The Clinical Significance of Transcript Alignment Discrepancies … and tools t...
The Clinical Significance of Transcript Alignment Discrepancies … and tools t...The Clinical Significance of Transcript Alignment Discrepancies … and tools t...
The Clinical Significance of Transcript Alignment Discrepancies … and tools t...
 
2015 09-29-sbc322-methods.key
2015 09-29-sbc322-methods.key2015 09-29-sbc322-methods.key
2015 09-29-sbc322-methods.key
 
Examining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingExamining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencing
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
Browsing Genes, Variation and Regulation data with Ensembl
Browsing Genes, Variation and Regulation data with EnsemblBrowsing Genes, Variation and Regulation data with Ensembl
Browsing Genes, Variation and Regulation data with Ensembl
 
Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysis
 
Digital RNAseq Technology Introduction: Digital RNAseq Webinar Part 1
Digital RNAseq Technology Introduction: Digital RNAseq Webinar Part 1Digital RNAseq Technology Introduction: Digital RNAseq Webinar Part 1
Digital RNAseq Technology Introduction: Digital RNAseq Webinar Part 1
 
NGS Presentation .pptx
NGS Presentation  .pptxNGS Presentation  .pptx
NGS Presentation .pptx
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
20140710 3 l_paul_ercc2.0_workshop
20140710 3 l_paul_ercc2.0_workshop20140710 3 l_paul_ercc2.0_workshop
20140710 3 l_paul_ercc2.0_workshop
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
201007131ghfjjkklllllllll14012254-152438.ppt
201007131ghfjjkklllllllll14012254-152438.ppt201007131ghfjjkklllllllll14012254-152438.ppt
201007131ghfjjkklllllllll14012254-152438.ppt
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course -...
 
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
Phylogeny-driven approaches to microbial & microbiome studies: talk by Jonath...
 

More from Thomas Keane

Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotations
Thomas Keane
 
Mousegenomes tk-wtsi (1)
Mousegenomes tk-wtsi (1)Mousegenomes tk-wtsi (1)
Mousegenomes tk-wtsi (1)
Thomas Keane
 
Large Scale Resequencing: Approaches and Challenges
Large Scale Resequencing: Approaches and ChallengesLarge Scale Resequencing: Approaches and Challenges
Large Scale Resequencing: Approaches and Challenges
Thomas Keane
 
Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...
Thomas Keane
 
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Thomas Keane
 
Next generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraNext generation sequencing in cloud computing era
Next generation sequencing in cloud computing era
Thomas Keane
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
Thomas Keane
 
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
Thomas Keane
 
Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010
Thomas Keane
 
Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010
Thomas Keane
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing Tutorial
Thomas Keane
 

More from Thomas Keane (11)

Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotations
 
Mousegenomes tk-wtsi (1)
Mousegenomes tk-wtsi (1)Mousegenomes tk-wtsi (1)
Mousegenomes tk-wtsi (1)
 
Large Scale Resequencing: Approaches and Challenges
Large Scale Resequencing: Approaches and ChallengesLarge Scale Resequencing: Approaches and Challenges
Large Scale Resequencing: Approaches and Challenges
 
Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...
 
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...
 
Next generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraNext generation sequencing in cloud computing era
Next generation sequencing in cloud computing era
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
 
Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010
 
Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing Tutorial
 

Recently uploaded

Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Texas Alliance of Groundwater Districts
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
İsa Badur
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
European Sustainable Phosphorus Platform
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 

Recently uploaded (20)

Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
aziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobelaziz sancar nobel prize winner: from mardin to nobel
aziz sancar nobel prize winner: from mardin to nobel
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 

2014 Wellcome Trust Advances Course: NGS Course - Lecture2

  • 1. WTAC NGS Course, Hinxton 12th April 2014 Lecture 2: Identification of SNPs, Indels, and structural variants Thomas Keane Sequence Variation Infrastructure Group WTSI Today's slides: ftp://ftp-mouse.sanger.ac.uk/other/tk2/WTAC-2014/Lecture2.pdf
  • 2. WTAC NGS Course, Hinxton 12th April 2014 Lecture 2: Identification of SNPs, Indels, and structural variants ➢ VCF Format ➢ SNP/indel Identification ➢ Structural Variation
  • 3. WTAC NGS Course, Hinxton 10th April 2014 VCF: Variant Call Format VCF is a standardised format for storing DNA polymorphism data ● SNPs, insertions, deletions and structural variants ● With rich annotations (e.g. context, predicted function, sequence data support) Indexed for fast data retrieval of variants from a range of positions Store variant information across many samples Record meta-data about the site ● dbSNP accession, filter status, validation status Very flexible format ● Arbitrary tags can be introduced to describe new types of variants ● No two VCF files are necessarily the same ● User extensible annotation fields supported ● Same event can be expressed in multiple ways by including different numbers ● Recommendation on VCF format website to ensure consistency
  • 4. WTAC NGS Course, Hinxton 10th April 2014 VCF Format Header section and a data section Header ● Arbitrary number of meta-data information lines ● Starting with characters ‘##’ ● Column definition line starts with single ‘#’ Mandatory columns ● Chromosome (CHROM) ● Position of the start of the variant (POS) ● Unique identifiers of the variant (ID) ● Reference allele (REF) ● Comma separated list of alternate non-reference alleles (ALT) ● Phred-scaled quality score (QUAL) ● Site filtering information (FILTER) ● User extensible annotation (INFO)
  • 5. WTAC NGS Course, Hinxton 10th April 2014 Example VCF (SNPs/indels)
  • 6. WTAC NGS Course, Hinxton 10th April 2014 VCF Trivia 1 What version of the human reference genome was used? What does the DB INFO tag stand for? What does the ALT column contain? At position 17330, what is the total depth? What is the depth for sample NA00002? At position 17330, what is the genotype of NA00002? Which position is a tri-allelic SNP site? What sort of variant is at position 1234567? What is the genotype of NA00002?
  • 7. WTAC NGS Course, Hinxton 10th April 2014 Functional Annotation VCF can store arbitrary ● INFO tags per site ● Genotype FORMAT tags Use tags to describe ● Genomic context of the variant (e.g. coding, intronic, non-coding, UTR, intergenic) ● Predicted functional consequence of the variant (e.g. synonymous/non- synonymous, protein structure change) ● Presence of the variant in other large resequencing studies Several tools for annotating a VCF ● SnpEff: http://snpeff.sourceforge.net/ ● Ensembl VEP: http://www.ensembl.org/info/docs/tools/vep/script/index.html ● FunSeq: http://funseq.gersteinlab.org/
  • 8. WTAC NGS Course, Hinxton 10th April 2014 Ensembl - VEP "VEP determines the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions." Species must be included in either Ensembl OR Ensembl genomes Sequence ontology (SO) terms to describe genomic context Pubmed IDs for variants cited Output only the most severe consequence per variation. Online or off-line mode ● Off-line recommended for large numbers of variants (download relevant cache) Human specific annotations ● Sift - predicts whether an amino acid substitution affects protein function ● Polyphen - predicts impact of an amino acid substitution on the structure of human proteins ● 1000 genomes frequencies - global or per population
  • 9. WTAC NGS Course, Hinxton 10th April 2014 VEP VCF VEP INFO tag: ● ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP. Format: Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Prote in_position|Amino_acids|Codons|Existing_variation|AA_MAF|EA_MAF|DISTANCE|S TRAND|CLIN_SIG|SYMBOL|SYMBOL_SOURCE|SIFT|PolyPhen|AFR_MAF|AMR_ MAF|ASN_MAF|EUR_MAF"> Example ● CSQ=T|ENSG00000238962|ENST00000458792|Transcript|upstream_gene_variant| |||||rs72779452|||3789|-1||RNU7-176P|HGNC|||0.02|0.10|0.07|0.17, T|ENSG00000143870|ENST00000404824|Transcript|synonymous_variant|474|102| 34|A|gcC/gcA|rs72779452||||-1||PDIA6|HGNC|||0.02|0.10|0.07|0.17, T|ENSG00000143870|ENST00000381611|Transcript|5_prime_UTR_variant|264|||||r s72779452||||-1||PDIA6|HGNC|||0.02|0.10|0.07|0.17
  • 10. WTAC NGS Course, Hinxton 10th April 2014 More Information VCF ● http://bioinformatics.oxfordjournals.org/content/27/15/2156.full ● http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf- variant-call-format-version-41 VCFTools ● http://vcftools.sourceforge.net GATK ● http://www.broadinstitute.org/gatk/ ● http://www.broadinstitute.org/gatk/guide/article?id=1268 VCF Annotation ● Ensembl VEP: http://www.ensembl.org/info/docs/tools/vep/index.html ● SNPeff: http://snpeff.sourceforge.net/ ● Anntools: http://anntools.sourceforge.net/
  • 11. WTAC NGS Course, Hinxton 12th April 2014 Lecture 2: Identification of SNPs, Indels, and structural variants ➢ VCF Format ➢ SNP/indel Identification ➢ Structural Variation
  • 12. WTAC NGS Course, Hinxton 12th April 2014 SNP Identification SNP - single nucleotide polymorphisms ● Examine the bases aligned to position and look for differences SNP discovery vs genotyping ● Finding new variant sites ● Determining the genotype at a set of already known sites Factors to consider when calling SNPs ● Base call qualities of each supporting base ● Proximity to ○ Small indel ○ Homopolymer run (>4-5bp for 454 and >10bp for illumina) ● Mapping qualities of the reads supporting the SNP ○ Low mapping qualities indicates repetitive sequence ● Read length ○ Possible to align reads with high confidence to larger portion of the genome with longer reads ● Paired reads ● Sequencing depth
  • 13. WTAC NGS Course, Hinxton 12th April 2014 Mouse SNP
  • 14. WTAC NGS Course, Hinxton 12th April 2014 Read Length vs. Uniqueness
  • 15. WTAC NGS Course, Hinxton 12th April 2014 Inaccessible Genome
  • 16. WTAC NGS Course, Hinxton 12th April 2014 Is this a real SNP?
  • 17. WTAC NGS Course, Hinxton 12th April 2014 Evaluating SNPs Specificity vs sensitivity ● False positives vs. false negatives Desirable to have high sensitivity and specificity Sensitivity ● External sources of validation Specificity ● Test a random selection of snps by another technology ● e.g. Sequenom, Sanger sequencing… Receiver operator curves to investigate effects of varying parameters
  • 18. WTAC NGS Course, Hinxton 12th April 2014 Known Systematic Biases Many biases can be introduced in either sample preparation, sequencing process, computational alignment steps etc. ● Can generate false positive SNPs/indels Potential biases ● Strand bias ● End distance bias ● Consistency across replicates/libraries ● Variant distance bias VCF Tools ● Soft filter variants file for these biases ● Variants kept in the file - just annotated with potential bias affecting the variant
  • 19. WTAC NGS Course, Hinxton 12th April 2014 Strand Bias
  • 20. WTAC NGS Course, Hinxton 12th April 2014 End Distance Bias
  • 21. WTAC NGS Course, Hinxton 12th April 2014 Variant Distance Bias
  • 22. WTAC NGS Course, Hinxton 12th April 2014 Reproducibility
  • 23. WTAC NGS Course, Hinxton 12th April 2014 Future of Variant Calling? Current approaches ● Rely heavily on the supplied alignment ● Largely site based, don't examine local haplotype Local denovo assembly based variant callers ● Calls SNP, INDEL, MNP and small SV simultaneously ● Can removes mapping artifacts ● e.g. GATK haplotype caller
  • 24. WTAC NGS Course, Hinxton 12th April 2014 Haplotype Based Calling - GATK
  • 25. WTAC NGS Course, Hinxton 12th April 2014 Lecture 2: Identification of SNPs, Indels, and structural variants ➢ VCF Format ➢ SNP/indel Identification ➢ Structural Variation
  • 26. WTAC NGS Course, Hinxton 12th April 2014 Genomic Structural Variation Large DNA rearrangements (>100bp) Frequent causes of disease ● Referred to as genomic disorders ● Mendelian diseases or complex traits such as behaviors ● E.g. increase in gene dosage due to increase in copy number ● Prevalent in cancer genomes Many types of genomic structural variation (SV) ● Insertions, deletions, copy number changes, inversions, translocations & complex events Comparative genomic hybridization (CGH) traditionally used to for copy number discovery ● CNVs of 1-50 kb in size have been under-ascertained Next-gen sequencing revolutionised field of SV discovery ● Parallel sequencing of ends of large numbers of DNA fragments ● Examine alignment distance of reads to discover presence of genomic rearrangements ● Resolution down to ~100bp
  • 27. WTAC NGS Course, Hinxton 12th April 2014 Human Disease Stankiewicz and Lupski (2010) Ann. Rev. Med.
  • 28. WTAC NGS Course, Hinxton 12th April 2014 Structural Variation Several types of structural variations (SVs) ● Large Insertions/deletions ● Inversions ● Translocations Read pair information used to detect these events ● Paired end sequencing of either end of DNA fragment ● Observe deviations from the expected fragment size ● Presence/absence of mate pairs
  • 29. WTAC NGS Course, Hinxton 12th April 2014 Structural Variation Types
  • 30. WTAC NGS Course, Hinxton 10th April 2014 Fragment Size QC
  • 31. WTAC NGS Course, Hinxton 10th April 2014 What is this?
  • 32. WTAC NGS Course, Hinxton 12th April 2014 What is this?
  • 33. WTAC NGS Course, Hinxton 12th April 2014 What is this?
  • 34. WTAC NGS Course, Hinxton 12th April 2014 Mobile Element Insertions Transposons are segments of DNA that can move within the genome ● A minimal ‘genome’ - ability to replicate and change location ● Relics of ancient viral infections Dominate landscape of mammalian genomes ● 38-45% of rodent and primate genomes ● Genome size proportional to number of TEs Class 1 (RNA intermediate) and 2 (DNA intermediate) Potent genetic mutagens ● Disrupt expression of genes ● Genome reorganisation and evolution ● Transduction of flanking sequence Species specific families ● Human: Alu, L1, SVA ● Mouse: SINE, LINE, ERV Many other families in other species
  • 35. WTAC NGS Course, Hinxton 12th April 2014 Human Mobile Elements
  • 36. WTAC NGS Course, Hinxton 12th April 2014 Mobile Element Insertions
  • 37. WTAC NGS Course, Hinxton 12th April 2014 Mouse Example - LookSeq
  • 38. WTAC NGS Course, Hinxton 12th April 2014 Human Alu - IGV
  • 39. WTAC NGS Course, Hinxton 12th April 2014 Detecting Mobile Element Insertions Most algorithms for locating non-reference mobile elements operate in a similar manner Goal: Detect all read pairs where one-end is flanking the insertion point and mate is in the inserted sequence Pseudo algorithm ● Read through BAM file and make list of all discordant read pairs ● Filter the reads where one end is similar to your library of mobile elements ● Remove anchor reads with low mapping quality ● Cluster the anchor reads and examine breakpoint ● Filter out any clusters close to annotated elements of the same type
  • 40. WTAC NGS Course, Hinxton 12th April 2014 1000 Genomes CEU Trio Typical human sample ~900-1000 non-reference mobile elements ● ~800 Alu elements, ~100 L1 Why are there 44 calls private to the child?
  • 41. WTAC NGS Course, Hinxton 12th April 2014 Mobile Element Software RetroSeq: https://github.com/tk2/RetroSeq VariationHunter: http://compbio.cs.sfu.ca/strvar. htm T-LEX: http://petrov.stanford.edu/cgi-bin/Tlex. html Tea: http://compbio.med.harvard.edu/Tea/