This document provides an overview of downstream analyses that can be performed after variant identification and filtering in a typical variant calling pipeline. It discusses visualization of variant data in each gene to identify potential causative variants. It also mentions association studies as another type of downstream analysis where variants are tested for association with disease phenotypes. The goal of downstream analyses is to help prioritize variants for further investigation.
Course: Bioinformatics for Biomedical Research (2014).
Session: 2.3- Introduction to NGS Variant Calling Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Next Generation Sequencing and its Applications in Medical Research - Frances...Sri Ambati
The so-called “next-generation” sequencing (NGS) technologies allows us, in a short time and in parallel, to sequence massive amounts of DNA, overcoming the limitations of the original Sanger sequencing methods used to sequence the first human genome. NGS technologies have had an enormous impact on biomedical research within a short time frame. This talk will give an overview of these applications with specific examples from Mendelian genomics and cancer research. #h2ony
AGRF in conjunction with EMBL Australia recently organised a workshop at Monash University Clayton. This workshop was targeted at beginners and biologists who are new to analysing Next-Gen Sequencing data. The workshop also aimed to provide users with a snapshot of bioinformatics and data analysis tips on how to begin to analyse project data. An introduction to RNA-seq data analysis was presented by AGRF Senior Bioinformatician Dr. Sonika Tyagi.
Presented: 1st August 2012
Course: Bioinformatics for Biomedical Research (2014).
Session: 2.3- Introduction to NGS Variant Calling Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
Next Generation Sequencing and its Applications in Medical Research - Frances...Sri Ambati
The so-called “next-generation” sequencing (NGS) technologies allows us, in a short time and in parallel, to sequence massive amounts of DNA, overcoming the limitations of the original Sanger sequencing methods used to sequence the first human genome. NGS technologies have had an enormous impact on biomedical research within a short time frame. This talk will give an overview of these applications with specific examples from Mendelian genomics and cancer research. #h2ony
AGRF in conjunction with EMBL Australia recently organised a workshop at Monash University Clayton. This workshop was targeted at beginners and biologists who are new to analysing Next-Gen Sequencing data. The workshop also aimed to provide users with a snapshot of bioinformatics and data analysis tips on how to begin to analyse project data. An introduction to RNA-seq data analysis was presented by AGRF Senior Bioinformatician Dr. Sonika Tyagi.
Presented: 1st August 2012
An introduction to the tools and methods used for the bioinformatics analysis of ChIP-Seq data.
Written and delivered for the "Epigenetics and its applications in clinical research" course at the Karolinska Institute in Stockholm, Sweden.
Presentation by Scott Woodman, MD, PhD. Presented at the 2018 Eyes on a Cure: Patient & Caregiver Symposium, hosted by the Melanoma Research Foundation's CURE OM initiative.
DNA Methylation: An Essential Element in Epigenetics Facts and TechnologiesQIAGEN
Check out this slide deck from Dr. Thorsten Singer and Dr. Ralf Peist to learn about DNA methylation in epigenetics, from its significance in cancer to strategies for studying it.
complete Single Nucleotide Polymorphiitsm Detection methods with Advance techniques with its applications
Single nucleotide polymorphisms are single base variations between genomes within a species.
There are at least 10 million polymorphic sites in the human genome.
SNPs can distinguish individuals from one another
Denaturing Gradient Gel Electrophoresis
Chemical Cleavage Of Mismatch
Single-stranded Conformation Polymorphism (SSCP)
MutS Protein-binding Assays
Mismatch Repair Detection (MRD)
Heteroduplex Analysis (HA)
Denaturing High Performance Liquid Chromatography (DHPLC)
UNG-Mediated T-Sequencing
RNA-Mediated Finger printing with MALDI MS Detection
Sequencing by Hybridization
Direct DNA Sequencing
Single-feature polymorphism (SFP)
Invader probe
Allele-specific oligonucleotide probes
PCR-based methods
Allele specific primers
Sequence Polymorphism-Derived (SPD) markers
Targeting induced local lesions in genomes (TILLinG)
Minisequencing primers
Allele-specific ligation probes
Knowing Your NGS Upstream: Alignment and VariantsGolden Helix Inc
Alignment algorithms are not just about placing reads in best-matching locations to a reference genome. They are now being expected to handle small insertions, deletions, gapped alignment of reads across intron boundaries and even span breakpoints of structural variations, fusions and copy number changes. At the same time, variant-calling algorithms can only reach their full potential by being intimately matched to the aligner's output or by doing local assemblies themselves. Knowing when these tools can be expected to perform well and when they will produce technical artifacts or be incapable of detecting features is critical when interpreting any analysis based on their output.
This presentation will compare the performance of the alignment and variant calling tools used by sequencing service providers including Illumina Genome Network, Complete Genomics and The Broad Institute. Using public samples analyzed by each pipeline, we will look at the level of concordance and dive into investigating problematic variants and regions of the genome.
Presentation carried out by Sergi Beltran Agulló, from the CNAG, at the course: Identification and analysis of sequence variants in sequencing projects: fundamentals and tools .
Next Generation Sequencing (NGS) Is A Modern And Cost Effective Sequencing Technology Which Enables Scientists To Sequence Nucleic Acids At Much Faster Rate. In This Presentation, You Will Learn About What is NGS, Idea Behind NGS, Methodology And Protocol, Widely Adapted NGS Protocols, Applications And References For Further Study.
RNA Sequence data analysis,Transcriptome sequencing, Sequencing steady state RNA in a sample is known as RNA-Seq. It is free of limitations such as prior knowledge about the organism is not required.
RNA-Seq is useful to unravel inaccessible complexities of transcriptomics such as finding novel transcripts and isoforms.
Data set produced is large and complex; interpretation is not straight forward.
An introduction to the tools and methods used for the bioinformatics analysis of ChIP-Seq data.
Written and delivered for the "Epigenetics and its applications in clinical research" course at the Karolinska Institute in Stockholm, Sweden.
Presentation by Scott Woodman, MD, PhD. Presented at the 2018 Eyes on a Cure: Patient & Caregiver Symposium, hosted by the Melanoma Research Foundation's CURE OM initiative.
DNA Methylation: An Essential Element in Epigenetics Facts and TechnologiesQIAGEN
Check out this slide deck from Dr. Thorsten Singer and Dr. Ralf Peist to learn about DNA methylation in epigenetics, from its significance in cancer to strategies for studying it.
complete Single Nucleotide Polymorphiitsm Detection methods with Advance techniques with its applications
Single nucleotide polymorphisms are single base variations between genomes within a species.
There are at least 10 million polymorphic sites in the human genome.
SNPs can distinguish individuals from one another
Denaturing Gradient Gel Electrophoresis
Chemical Cleavage Of Mismatch
Single-stranded Conformation Polymorphism (SSCP)
MutS Protein-binding Assays
Mismatch Repair Detection (MRD)
Heteroduplex Analysis (HA)
Denaturing High Performance Liquid Chromatography (DHPLC)
UNG-Mediated T-Sequencing
RNA-Mediated Finger printing with MALDI MS Detection
Sequencing by Hybridization
Direct DNA Sequencing
Single-feature polymorphism (SFP)
Invader probe
Allele-specific oligonucleotide probes
PCR-based methods
Allele specific primers
Sequence Polymorphism-Derived (SPD) markers
Targeting induced local lesions in genomes (TILLinG)
Minisequencing primers
Allele-specific ligation probes
Knowing Your NGS Upstream: Alignment and VariantsGolden Helix Inc
Alignment algorithms are not just about placing reads in best-matching locations to a reference genome. They are now being expected to handle small insertions, deletions, gapped alignment of reads across intron boundaries and even span breakpoints of structural variations, fusions and copy number changes. At the same time, variant-calling algorithms can only reach their full potential by being intimately matched to the aligner's output or by doing local assemblies themselves. Knowing when these tools can be expected to perform well and when they will produce technical artifacts or be incapable of detecting features is critical when interpreting any analysis based on their output.
This presentation will compare the performance of the alignment and variant calling tools used by sequencing service providers including Illumina Genome Network, Complete Genomics and The Broad Institute. Using public samples analyzed by each pipeline, we will look at the level of concordance and dive into investigating problematic variants and regions of the genome.
Presentation carried out by Sergi Beltran Agulló, from the CNAG, at the course: Identification and analysis of sequence variants in sequencing projects: fundamentals and tools .
Next Generation Sequencing (NGS) Is A Modern And Cost Effective Sequencing Technology Which Enables Scientists To Sequence Nucleic Acids At Much Faster Rate. In This Presentation, You Will Learn About What is NGS, Idea Behind NGS, Methodology And Protocol, Widely Adapted NGS Protocols, Applications And References For Further Study.
RNA Sequence data analysis,Transcriptome sequencing, Sequencing steady state RNA in a sample is known as RNA-Seq. It is free of limitations such as prior knowledge about the organism is not required.
RNA-Seq is useful to unravel inaccessible complexities of transcriptomics such as finding novel transcripts and isoforms.
Data set produced is large and complex; interpretation is not straight forward.
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference Databasenist-spin
"Development of FDA MicroDB: A Regulatory-Grade
Microbial Reference Database" presentation at the Standards for Pathogen Identification via NGS (SPIN) workshop hosted by National Institute for Standards and Technology October 2014 by Heike Sichtig, PhD from the FDA and Luke Tallon from IGS UMSOM.
Development of FDA MicroDB: A Regulatory-Grade Microbial Reference DatabaseNathan Olson
"Development of FDA MicroDB: A Regulatory-Grade
Microbial Reference Database" presentation at the Standards for Pathogen Identification via NGS (SPIN) workshop hosted by the National Institute for Standards and Technology October 2014 by Heike Sichtig, PhD from the FDA and Luke Tallon from IGS UMSOM.
Apollo is a web-based application that supports and enables collaborative genome curation in real time, allowing teams of curators to improve on existing automated gene models through an intuitive interface. Apollo allows researchers to break down large amounts of data into manageable portions to mobilize groups of researchers with shared interests.
The i5K, an initiative to sequence the genomes of 5,000 insect and related arthropod species, is a broad and inclusive effort that seeks to involve scientists from around the world in their genome curation process, and Apollo is serving as the platform to empower this community.
This presentation is an introduction to Apollo for the members of the i5K Pilot Project working on species of the order Hemiptera.
VarSeq 2.4.0: VSClinical ACMG Workflow from the User PerspectiveGolden Helix
Earlier this year, we released VarSeq 2.3.0 which brought massive updates to our VSClinical AMP interface, such as enhanced capabilities for automation and analysis of structural variants in the cancer context. Naturally, we wanted to follow that up shortly with similar advancements to our VSClinical ACMG interface, and also make our customers doing germline variant analysis happy.
Our latest software release, VarSeq 2.4.0, was therefore focused on the advancements in VSClinical ACMG, namely support for importing and clinically evaluating structural variants, long read sequencing, advanced automation with evaluation scripts in VSClinical ACMG and end-to-end automation of ACMG workflows with VSPipeline. These new and improved features were discussed in a great webcast by our VP of Product and Engineering, Gabe Rudy, last month.
This upcoming webcast by our FAS team will be a user’s perspective on the new features in VarSeq 2.4.0 and VSClinical ACMG and how our tools can precisely and efficiently enable the full spectrum NGS analysis for Mendelian disorders.
VarSeq 2.4.0: VSClinical ACMG Workflow from the User PerspectiveGolden Helix
Earlier this year, we released VarSeq 2.3.0 which brought massive updates to our VSClinical AMP interface, such as enhanced capabilities for automation and analysis of structural variants in the cancer context. Naturally, we wanted to follow that up shortly with similar advancements to our VSClinical ACMG interface, and also make our customers doing germline variant analysis happy.
Our latest software release, VarSeq 2.4.0, was therefore focused on the advancements in VSClinical ACMG, namely support for importing and clinically evaluating structural variants, long read sequencing, advanced automation with evaluation scripts in VSClinical ACMG and end-to-end automation of ACMG workflows with VSPipeline. These new and improved features were discussed in a great webcast by our VP of Product and Engineering, Gabe Rudy, last month.
This upcoming webcast by our FAS team will be a user’s perspective on the new features in VarSeq 2.4.0 and VSClinical ACMG and how our tools can precisely and efficiently enable the full spectrum NGS analysis for Mendelian disorders.
Presentation by Justin Zook at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on benchmarks for indels and structural variants.
Overview of the commonly used sequencing platforms, bioinformatic search tool...OECD Environment
24 June 2019: This OECD seminar presented and discussed the potential use of genome sequence, bioinformatic tools and databases in a regulatory decision process for microbial pesticides.
An introduction to Web Apollo for the Biomphalaria glabatra research community.Monica Munoz-Torres
Web Apollo is a web-based, collaborative genomic annotation editing platform. We need annotation editing tools to modify and refine precise location and structure of the genome elements that predictive algorithms cannot yet resolve automatically.
This presentation is an introduction to how the manual annotation process takes place using Web Apollo. It is addressed to the members of the Biomphalaria glabatra research community.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
1. February 4th, 2015
Mariam Quiñones, PhD
Computational Molecular Biology Specialist
Bioinformatics and Computational Biosciences Branch
OCICB/OSMO/OD/NIAID/NIH
2. BCBB: A Branch Devoted to Bioinformatics and
Computational Biosciences
Researchers’ time is increasingly important
BCBB saves our collaborators time and effort
Researchers speed projects to completion using
BCBB consultation and development services
No need to hire extra post docs or use external
consultants or developers
What makes us different?
Software DevelopersComputational
Biologists
PMs and Analysts
2
3. BCBB
“NIH Users: Access a menu of
BCBB services on the NIAID
Intranet:
• http://bioinformatics.niaid.nih.gov/
Outside of NIH –
• search “BCBB” on the NIAID
Public Internet Page:
www.niaid.nih.gov
Email us at:
• ScienceApps@niaid.nih.gov
3
4. Remember Sanger?
• Sanger introduced the “dideoxy method” (also known as Sanger
sequencing) in December 1977
Alignment of reads using tools such as ‘Sequencher’
IMAGE: http://www.lifetechnologies.com
4
6. File formats for aligned reads
SAM (sequence alignment map)
CTTGGGCTGCGTCGTGTCTTCGCTTCACACCCGCGACGAGCGCGGCTTCT
CTTGGGCTGCGTCGTGTCTTCGCTTCACACC
Chr_ start end
Chr2 100000 100050
Chr_ start end
Chr2 50000 50050
6
7. Most commonly used alignment file formats
SAM (sequence alignment map)
Unified format for storing alignments to a reference genome
BAM (binary version of SAM) – used commonly to deliver data
Compressed SAM file, is normally indexed
BED
Commonly used to report features described by chrom, start, end, name,
score, and strand.
For example:
chr1 11873 14409 uc001aaa.3 0 +
7
8. Variant Analysis – Class Topics
Introduction
Variant calling,
filtering and
annotation
Overview of
downstream
analyses
8
9. Efforts at creating databases of variants:
• Project that started on 2002 with the goal of describing patterns of human
genetic variation and create a haplotype map using SNPs present in at
least 1% of the population, which were deposited in dbSNPs.
1000 Genomes
• Started in 2008 with a goal of using at least 1000 individuals (about
2,500 samples at 4X coverage), interrogate 1000 gene regions in 900
samples (exome analysis), find most genetic variants with allele
frequencies above 1% and to a 0.1% if in coding regions as well as
Indels and structural variants
• Make data available to the public ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/
or via Amazon Cloud http://s3.amazonaws.com/1000genomes
9
10. Variants are not only found in coding sequences
Main Goal:
• Find all functional elements in the genome
10
11. Read
alignment
• Mapping reads to reference
genome, processing to improve
alignment and mark PCR duplicates
Variant
identification
• Variant calling and
genotyping
Annotation
and Filtering
• Removing likely false
positive variants
Visualization
and
downstream
analyses
• (e.g.
association
studies)
A typical variant calling pipeline
11
12. Mapping reads to reference
Whole Genome Exome-Seq
Useful for identifying
SNPs but also structural
variants in any genomic
region.
Interrogates variants in
specific genomic regions,
usually coding sequences.
Cheaper and provides high
coverage
Read
alignment
Recommended alignment tools: BWA-MEM, Bowtie2, ssaha2, novoalign.
See more tools:
• Seq Answers Forum: http://seqanswers.com/wiki/Software/list
12
13. How to identify structural variants?
http://www.sciencedirect.com/science/article/pii/S0168952511001685
Methods and SV tools
1- Read Pair – use pair ends.
GASVPro (Sindi 2012),
BreakDancer (Chen 2009)
2- Read depth – CNVnator
(Abyzov, 2010)
3- Split read – PinDEL (Ye
2009)
4- Denovo assembly
Learn more:
http://bit.ly/16A8FKR
Also see references:
http://bit.ly/1z7xnol - SV
http://bit.ly/qin294 - CNV
http://bit.ly/rLB7jc
http://bit.ly/17fdZTc
13
14. Exome-Seq
Targeted exome capture
targets ~20,000 variants
near coding sequences
and a few rare missense or
loss of function variants
Provides high depth of
coverage for more accurate
variant calling
It is starting to be used as a
diagnostic tool
Ann Neurol. 2012 Jan;71(1):5-14.
-Nimblegen
-Agilent
-Illumina
14
16. Mapping
subject and
family
sequence
data with
BWA
(paired end)
Variants
called with
SAMtools
Filtering out
common
variants in 1000
genomes (>0.05
allele freq)
500K variants
Annotation
using
ANNOVAR*
576 non-
synonymous
filtered
variants
Visualization
of each gene
Whole-exome sequencing reveals a heterozygous LRP5 mutation in a 6-
year-old boy with vertebral compression fractures and low trabecular bone
density Fahiminiya S, Majewski J, Roughley P, Roschger P, Klaushofer K, Rauch F. Bone: July 23, 2013
*Annovar annotated: the type of mutations, presence in
dbSNP132, minor allele frequency in the 1000 Genomes
project, SIFT, Polyphen-2 and PHASTCONS scores
16
17. Exome-seq
- Also used for finding somatic mutations
Exome sequencing of serous endometrial tumors identifies recurrent somatic
mutations in chromatin- remodeling and ubiquitin ligase complex genes
VOLUME 44 | NUMBER 12 | DECEMBER 2012 Nature Genetics
Figure 1: Somatic mutations
in CHD4, FBXW7 and SPOP
cluster within important
functional domains of the
encoded proteins.
17
18. Variant Analysis – Class Topics
Introduction
Variant
calling,
filtering and
annotation
Overview of
downstream
analyses
18
19. Variant
identification
Goals:
• Identify variant bases, genotype likelihood and
allele frequency while avoiding instrument noise.
• Generate an output file in vcf format (variant call
format) with genotypes assigned to each sample
19
20. Challenges of finding variants using NGS
data
Base calling errors
• Different types of errors that vary by technology, sequence
cycle and sequence context
Low coverage sequencing
• Lack of sequence from two chromosomes of a diploid
individual at a site
Inaccurate mapping
• Aligned reads should be reported with mapping quality score
20
22. Processing reads for Variant Analysis
Trimming:
• Unless quality of read data is poor, low quality end trimming of short
reads (e.g. Illumina) prior to mapping is usually not required.
• If adapters are present in a high number of reads, these could be
trimmed to improve mapping.
– Recommended tools: Prinseq, Btrim64, FastQC
Marking duplicates:
• It is not required to remove duplicate reads prior to mapping but
instead it is recommended to mark duplicates after the alignment.
– Recommended tool: Picard.
• A library that is composed mainly of PCR duplicates could produce
inaccurate variant calling.
http://www.biomedcentral.com/1471-2164/13/S8/S8
Clip: http://bestclipartblog.com 22
24. SNP Calling Methods
Early SNP callers and some commercial packages use a simple method
of counting reads for each allele that have passed a mapping quality
threshold. ** This is not good enough in particular when coverage is
low.
It is best to use advanced SNP callers which add more statistics for more
accurate variant calling of low coverage datasets and indels.
-e.g. GATK
24
25. SNP and Genotype calling
http://bib.oxfordjournals.org/content/early/2013/01/21/bib.bbs086/T1.expansion.html
Germline Variant Callers
Somatic Variant Callers
• Varscan, MuTEC
Samtools: https://github.com/samtools/bcftools/wiki/HOWTOs 25
26. VCF format (version 4.0)
Format used to report information about a position in the genome
Use by 1000 genomes to report all variants
26
28. GATK pipeline (BROAD) for variant calling
http://www.broadinstitute.org/gatk/guide/topic?name=methods-and-workflows
28
29. GATK Haplotype Caller shows improvements on detecting
misalignments around gaps and reducing false positives
29
30. GATK uses a Base Quality Recalibration method
http://www.nature.com/ng/journal/v43/n5/full/ng.806.html
Covariates
• the raw quality score
• the position of the base in the read
• the dinucleotide context and the read group
30
31. GATK uses a bayesian model
http://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-5-Variant_calling.pdf
• It determines possible SNP and indel alleles
• Computes, for each sample, for each genotype, likelihoods of data given
genotypes
• Computes the allele frequency distribution to determine most likely allele
count, and emit a variant call if determined
– When it reports a variant, it assign a genotype to each sample
31
32. From the output of thousands of variants,
which ones should you consider?
Various important considerations
• Is the variant call of good quality?
• If you expect a rare mutation, is the variant
commonly found in the general population?
• What is the predicted effect of the variant?
– Non-synonymous
– Detrimental for function
32
33. Read
alignment
• Mapping reads to reference
genome, processing to improve
alignment and mark pcr duplicates
Variant
identification
• Variant calling and
genotyping
Annotation
and Filtering
• Removing likely false
positive variants
Visualization
and
downstream
analyses
• (e.g.
association
studies)
A typical variant calling pipeline
33
34. Filtering variants
The initial set of variants is
usually filtered extensively in
hopes of removing false
positives.
More filtering can include public
data or custom filters based on
in-house data
2 novel in chr10
In house
exome
dbSNPs
1000
genomes
17,687 SNPs
PLoS One. 2012;7(1):e29708 34
35. Removing the common and low frequency to get
the novel SNPs
http://www.cardiovasculairegeneeskunde.nl/bestanden/Kathiresan.pdf
35
36. Filtering variants
The initial set of variants is usually filtered extensively
in hopes of removing false positives.
More filtering can include public data or custom filters
based on in-house data
Most exome studies will then filter common variants
(>1%) such as those in dbSNP database
– Good tools for filtering and annotation: Annovar, SNPeff
– Reports genes at or near variants as well as the type (non-
synonymous coding, UTR, splicing, etc.)
36
37. Filtering variants
The initial set of variants is usually filtered extensively in
hopes of removing false positives.
More filtering can include public data or custom filters
based on in-house data
Most exome studies will then filter common variants (>1%)
such as those in dbSNP database
– Good tools for filtering and annotation: Annovar, SNPeff
– Reports genes at or near variants as well as the type (non-
synonymous coding, UTR, splicing, etc.)
To filter by variant consequence, use SIFT and Polyphen2.
These are available via Annovar or ENSEMBL VEP
http://www.ensembl.org/Homo_sapiens/Tools/VEP
For a more flexible annotation workflow, explore GEMINI
http://gemini.readthedocs.org/en/latest/
37
38. Evaluating Variant Consequence
SIFT predictions - SIFT predicts whether an amino acid substitution affects
protein function based on sequence homology and the physical properties of
amino acids.
PolyPhen predictions - PolyPhen is a tool which predicts possible impact of
an amino acid substitution on the structure and function of a human protein
using straightforward physical and comparative considerations.
Condel consensus predictions - Condel computes a weighed average of the
scores (WAS) of several computational tools aimed at classifying missense
mutations as likely deleterious or likely neutral. The VEP currently presents a
Condel WAS from SIFT and PolyPhen.
38
39. Read
alignment
• Mapping reads to reference
genome, processing to improve
alignment and mark pcr duplicates
Variant
identification
• Variant calling and
genotyping
Annotation
and Filtering
• Removing likely false
positive variants
Visualization
and
downstream
analyses
• (e.g.
association
studies)
A typical variant calling pipeline
39
40. Viewing reads in browser
If your genome is available via the UCSC genome
browser http://genome.ucsc.edu/, import bam format
file to the UCSC genome browser by hosting the file on
a server and providing the link.
If your genome is not in UCSC, use another browser
such as IGV http://www.broadinstitute.org/igv/ , or IGB
http://bioviz.org/igb/
• Import genome (fasta)
• Import annotations (gff3 or bed format)
• Import data (bam)
40
41. Variant Analysis – Class Topics
Introduction
Variant
calling,
filtering and
annotation
Overview of
downstream
analyses
41
42. Before exome-seq there was GWAS
GWAS – Genome wide association study
Examination of common genetic variants in individuals
and their association with a trait
Analysis is usually association to disease (Case/control)
Genome-wide association
study of systemic sclerosis
identifies CD247 as a new
susceptibility locus
Nature Genetics 42, 426–429
(2010)
Manhatan Plot – each
dot is a SNP, Y-axis
shows association level
42
43. GWAS Association studies
A typical analysis:
• Identify SNPs
where one allele
is significantly
more common in
cases than
controls
• Hardy-Weinberg
Chi Square
http://en.wikipedia.org/wiki/Genome-wide_association_study 43
44. Analysis Tools for genotype-phenotype association
from sequencing data
Plink-seq - http://atgu.mgh.harvard.edu/plinkseq/
EPACTS - http://genome.sph.umich.edu/wiki/EPACTS
SNPTestv2 / GRANVIL - http://www.well.ox.ac.uk/GRANVIL/
44
45. Consider a more integrative
approach, which could
include adding more SNP or
exome data, biological
function, pedigree, gene
exprssion network and more.
Science 2002 Oltvai et al., pp. 763 - 764
45
46. A more integrative approach
http://www.genome.gov/Multimedia/
Slides/SequenceVariants2012/
Day2_IntegratedApproachWG.pdf
PLoS Genet. 2011 Jan
13;7(1):e1001273
46
47. Pathways Analysis
With pathways analysis
we can study:
• How genes and
molecules communicate
• The effect of a
disruption to a pathway
in relation to the
associated phenotype or
disease
• Which pathways are
likely to be affected an
affected individual
0 1 2 3 4 5
Reference
Test sample1
Test sample2
DNA damage Apoptosis
Interferon signalling
Enrichment and Network
47
48. Networks Analysis
Encore: Genetic Association Interaction Network Centrality Pipeline
and Application to SLE Exome Data
Nicholas A. Davis1,†,‡, Caleb A. Lareau2,3,‡, Bill C. White1, Ahwan Pandey1, Graham Wiley3,
Courtney G. Montgomery3, Patrick M. Gaffney3, B. A. McKinney1,2,*
Article first published online: 5 JUN 2013
• open source network analysis pipeline for genome-wide association
studies and rare variant data
48
49. Visualization
Browsers
• IGV - http://www.broadinstitute.org/igv/ - see slide 31
• UCSC Genome browser - http://genome.ucsc.edu/
Other
• Cytoscape – can be used for example to visualize
variants in the context of affected genes
49
50. Other applications of variant calling in
Deep Sequence: Virus evolution
Deep Sequencing detects early variants in HIV evolution
http://www.broadinstitute.org/scientific-community/science/projects/viral-genomics/v-phaser
http://www.plospathogens.org/article/info%3Adoi%2F10.1371%2Fjournal.ppat.1002529 50
51. Quasispecies evolution
Whole Genome Deep
Sequencing of HIV-1 Reveals
the Impact of Early Minor
Variants Upon Immune
Recognition During Acute
Infection
They concluded that the early
control of HIV-1 replication by
immunodominant CD8+ T cell
responses may be substantially
influenced by rapid, low
frequency viral adaptations not
detected by conventional
sequencing approaches, which
warrants further investigation.
http://www.plospathogens.org/article/info%3Adoi%2F10.1371%2Fjournal.ppat.1002529
51
52. Other applications of variant calling in
Deep Sequence: Fast genotyping
Outbreak investigation using high-throughput genome sequencing
within a diagnostic microbiology laboratory. J Clin Microbiol. 2013 Feb 13
Ion
Torrent
was used!
52