SlideShare a Scribd company logo
Whole exome sequencing
(WES|WXS)
and its data analysis
Feb 28, 2023
Haibo Liu
Senior Bioinformatician
UMass Medical School, Worcester, MA
Email: haibol2017@gmail.com
Eukaryotic Exome
The human exome contains about 180,000 exons. These constitute about 1% of
the human genome (~40 Mb).
Exome sequencing
• A NGS method that selectively sequences the transcribed
regions of the genome.
• Provides a cost-effective alternative to WGS
• Produces a smaller, more manageable data set for faster, easier data
analysis (4–5 Gb WES vs ~90 Gb WGS)
• Identify both somatic and germline variants
• Single Nucleotide Polymorphisms (SNPs)
• Small Insertions-Deletions (indels)
• Loss of Heterozygosity (LOH)
• Copy Number Variants (CNVs), structural variants (SV)
• Microsatellite stability
Performance of WES in clinical studies
Workflow of WES
Genotyping by Microarray, WES, and WGS
(not updated, data
analysis cost not
included)
Experimental design of WES
• Tissue sampling
• Somatic mutations
• Tumor (tumor purity and freshness are critical)
• Normal tissue or blood sample
• Germline mutations
• Blood or any other tissue
• Sample size and sample population
• cohort (disease vs health)
• Trio, related family (non-carrier, carrier, and patient)
• Capture methods
• Sequencing strategies
• platform, PE|SE, UMI, read length, seq. depth
Rescue to DNA preparation from FFPE fixed
samples
Exome capture: Target-enrichment strategies
Array-based capture
https://en.wikipedia.org/wiki/Exome_sequencing
• Twist Exome 2.0 (Twist
Bioscience)
• Nextera Rapid Capture
Exomes (Illumina)
• xGen WES (IDT)
• SureSelect (Agilent)
• KAPA HyperExome
(Roche)
• SeqCap (NimblGen)
• …
Capture toolkits
UMI for detecting low frequency mutations for
prenatal or cancer research
The Cell3™ Target library preparation behind our whole exome enrichment incorporates error suppression
technology. This includes unique molecular indexes (UMIs) and unique dual indexes (UDIs), to remove both
PCR and sequencing errors and index hopping events. This error suppression technique, combined with our
excellent uniformity of coverage, allows you to confidently and accurately call mutations down to 0.1% VAF
and enables generation of sequencing libraries from as little as 1 ng cfDNA input.
Comparison of different library preparation methods
Comparison of different library preparation methods
Sequencing depth
Quality control in WES
Raw data QC BAM QC
variant QC
Raw data QC
• QC tools
• FastQC/MultiQC
• NGS QC toolkit (https://github.com/mjain-lab/NGSQCToolkit)
• QC-chain (contamination detection)
• PRINSEQ
• QC3
• Important QC metrics
• Base quality
• Nucleotide distribution along cycles
• GC content distribution
• Duplication rate
• Adaptor content
QC3
Read trimming
• Trimmomatic, cutadapt, fastp (auto adaptor detection), …
• Quality/adaptor trimming
• Don’t trim 5’ end (markduplicates)
From raw fastq to analysis-ready BAM
Aligner
• BWA-mem
• Bowtie2, Novoalign, GMAP
Selection of reference genomes
• Completeness
• Decoyed genome (1000 Genomes analysis pipeline)
• EBV (herpesvirus 4 type 1, AC:NC_007605) and decoy sequences
derived from HuRef, Human BAC and Fosmid clones and NA12878.
(~36Mb)
• T2T- CHM13v1.1, the latest, complete human reference
genome
Quality control in WES
Raw data QC BAM QC
variant QC
BAM QC
• Important QC metrics
• % of reads that map to the reference
• % of reads that map to the baits
• Coverage depth distribution (target regions)
• Coverage unevenness & Cohort Coverage Sparseness
• Insert size distribution
• Duplicate rate
• Tools
• Alfred
• QC3
• Various picard CollectMetrics tools
• covReport
Cohort Coverage Sparseness (CCS) and
Unevenness (UE) Scores for a detailed
assessment of the distribution of coverage of
sequence reads
https://www.nature.com/articles/s41598-017-01005-x
Local and global non-uniformity of
different capture toolkits
Differences from Capture toolkits
Differences from Capture toolkits
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4092227/
Exome probe design is one of the major
culprits
• Most of the observed bias in modern WES stems from
mappability limitations of short reads and exome probe design
rather than sequence composition.
https://www.nature.com/articles/s41598-020-59026-y
Alfred QC metrics
https://academic.oup.com/bioinformatics/article/35/14/2489/5232224
Alignment Metric DNA-Seq (WGS) DNA-Seq (Capture) RNA-Seq ChIP-Seq/ATAC-Seq Chart Type
Mapping Statistics ✔ ✔ ✔ ✔ Table
Duplicate Statistics ✔ ✔ ✔ ✔ Table
Sequencing Error Rates ✔ ✔ ✔ ✔ Table
Base Content Distribution ✔ ✔ ✔ ✔ Grouped Line Chart
Read Length Distribution ✔ ✔ ✔ ✔ Line Chart
Base Quality Distribution ✔ ✔ ✔ ✔ Line Chart
Coverage Histogram ✔ ✔ ✔ ✔ Line Chart
Insert Size Distribution ✔ ✔ ✔ ✔ Grouped Line Chart
InDel Size Distribution ✔ ✔ ✔ ✔ Grouped Line Chart
InDel Context ✔ ✔ ✔ ✔ Bar Chart
GC Content ✔ ✔ ✔ ✔ Grouped Line Chart
On-Target Rate ✔ Line Chart
Target Coverage Distribution ✔ Line Chart
TSS Enrichment ✔ Table
DNA pitch / Nucleosome pattern ✔ Grouped Line Chart
https://www.gear-genomics.com/docs/alfred/webapp/#featuresty-control)
CovReport
From BAM to VCF
GATK:
 Slop exon by 200 bp
 Analysis for each
chromosome
Variant callers
(Mutect2)
(HaplotypeCaller)
BreakSeq, LUMPY, Hydra,DELLY, CNVNator, Pindel
FreeBayes/SAMtools, DeepVariant
GATK Best practices for population-
based germline variant calling
GATK Mutect2 Best practices for population-
based soMATIC variant calling
Discrepancy of variants called by
different callers
Integrated variant calling
• Integration of multiple tools’ results
• Isma (integrative somatic mutation analysis)
• Ensemble Machine learning method
• BAYSIC
• SomaticSeq
• NeoMutate
• SMuRF
(Bartha and Gyorffy2019)
(Nanni et al. 2019)
Quality control in WES
Raw data QC BAM QC
variant QC
Sample-level Variant QC
• Tools
• GATK
CollectVariantCallingMetrics
, VCFtools, PLINK/seq, QC3
• Important QC metrics
• Ti/Tv ratio, nonsynonymous/synonymous,
heterozygous/nonreference-homozygous
(het/nonref-hom) ratio, mean depth,
• Genotype missing rate
• Genotype concordance to related data
(different platforms)
• Cross-sample DNA contamination
(VerifyBamID)
• Identity-by-descent (IBD) analysis (PLINK)
• Related samples
• PCA (EIGENSTRAT)
• Population stratum (ethnicity)
• Sex check (PLINK)
Ti/Tv ratio and het/nonref-hom ratio
• The Ti/Tv ratio varies greatly by genome region and
functionality, but not by ancestry.
• The het/nonref-hom ratio varies greatly by ancestry,
but not by genome regions and functionality.
• extreme guanine + cytosine content (either high or
low) is negatively associated with the Ti/Tv ratio
magnitude.
• when performing QC assessment using these two
measures, care must be taken to apply the correct
thresholds based on ancestry and genome region.
https://academic.oup.com/bioinformatics/article/31/3/318/2366248
Too low ==> high false positive rate; too high ==> bias.
Example report
Potential error sources in next-generation sequencing
workflow
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1659-6
Origin of variant artifacts
• Artifacts introduced by sample/library preparation
• low-quality base calls (Read-end artifacts and other low Qual bases)
• Alignment artifacts
• Local misalignment near indels,
• Erroneous alignments in low-complexity regions
• Paralogous alignments of reads not well represented in the reference
• Strand orientation bias artifacts (Strand Orientation Bias Detector
(SOBDetector), Fisher score)--
https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-666
•
Artifacts (https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-020-00791-w)
Low base qual Read end
Strand bias Low complexity misalignment Paralog misalgnment
Variant-level QC
• Important QC metrics
• Genotype missing rate
• Hardy-Weinberg Equilibrium (caution) p-value
• Mendelian error rate
• Allele balance of heterozygous calls
• Variant quality score (GATK): filtering SNP and INDELS
separately(https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-
filtering-germline-short-variants)
• Hard filter
• QualByDepth (QD)
• FisherStrand (FS)
• StrandOddsRatio (SOR)
• RMSMappingQuality (MQ)
• MappingQualityRankSumTest (MQRankSum)
• ReadPosRankSumTest (ReadPosRankSum)
• Machine learning-based filtering: Variant Quality Score Recalibration
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"
--filterName "my_snp_filter"
Variant-level filtering
• Tools
• GATK, VCFtools, PLINK/Seq
• Sequencing data-based filtering
• Exclude potential artifacts
• Database-based filtering:
• Exclude known variants which are present in public SNP databases,
published studies or in-house databases as it is assumed that common
variants represent harmless variations
• Pedigree-based filtering
• Each generation introduces up to 4.5 deleterious mutations, it might be as
well that a de novo mutation is causing the disease.
• Function-based filtering
• Caution: risk removing the pathogenic variant
Allelic balance
https://www.cureffi.org/2012/09/19/exome-sequencing-pipeline-using-gatk/
Allelic balance
• SLIVAR: genotype quality, sequencing depth, allele balance, and
population allele frequency : https://github.com/brentp/slivar
https://onlinelibrary.wiley.com/doi/full/10.1002/humu.23674
Variant annotation tools
VAT Annotation of variants
by functionality in a
cloud computing
environment.
Variant annotation databases
Functional predictors/Prioritization tools
snpSift http://pcingola.github.io/SnpEff/ss_introduction/
(Hintzsche et al., 2016)
VAAST https://github.com/Yandell-Lab/VVP-pub
VarSifter, VarSight
gNome, KGGseq
(Cheng et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense.
SCIENCE, 19 Sep 2023Vol 381, Issue 6664,DOI: 10.1126/science.adg7492)
Latest, advanced AI tool for infer effect of missense mutations: AlphaMissense
(Hintzsche et al., 2016)
Tools and resources for linking variants to
therapeutics
Variant visualization tools
VIVA, vcfR
oncoprint
Oncoprint for visualizing cohort variants
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6895801/
Beyond variants
Summary
WES and its data analysis
WES data analysis pipelines
• DRAGEN (Illumina)
• https://www.illumina.com/products/by-type/informatics-
products/basespace-sequence-hub/apps/dragen-enrichment.html
• JWES
• A high-performance commercial solution
(https://www.sentieon.com/products/)
• improves upon BWA, STAR, Minimap2, GATK, HaplotypeCaller,
Mutect, and Mutect2 based pipelines and is deployable on any
generic-CPU-based computing system
WES data analysis pipelines

More Related Content

What's hot

RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
mikaelhuss
 
Arms 2
Arms 2Arms 2
Lecture 7 gwas full
Lecture 7 gwas fullLecture 7 gwas full
Lecture 7 gwas full
Lekki Frazier-Wood
 
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5
BITS
 
Next Generation Sequencing of DNA
Next Generation Sequencing of DNANext Generation Sequencing of DNA
Next Generation Sequencing of DNA
maryamshah13
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
Bioinformatics and Computational Biosciences Branch
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation Sequencing
Farid MUSA
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
AGRF_Ltd
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
COST action BM1006
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
Sajad Rafatiyan
 
Ngs introduction
Ngs introductionNgs introduction
Ngs introduction
Alagar Suresh
 
RNA-Seq
RNA-SeqRNA-Seq
Digital PCR.pptx
Digital PCR.pptxDigital PCR.pptx
Digital PCR.pptx
AlanShwan2
 
Introduction to real-Time Quantitative PCR (qPCR) - Download the slides
Introduction to real-Time Quantitative PCR (qPCR) - Download the slidesIntroduction to real-Time Quantitative PCR (qPCR) - Download the slides
Introduction to real-Time Quantitative PCR (qPCR) - Download the slides
QIAGEN
 
Principle and workflow of whole genome bisulfite sequencing
Principle and workflow of whole genome bisulfite sequencingPrinciple and workflow of whole genome bisulfite sequencing
Principle and workflow of whole genome bisulfite sequencing
sciencelearning123
 
qRT-PCR.pdf
qRT-PCR.pdfqRT-PCR.pdf
qRT-PCR.pdf
ShadenAlharbi
 
Sanger sequencing
Sanger sequencingSanger sequencing
Sanger sequencing
SUJITSINGH134
 
Clinical Applications of Next Generation Sequencing
Clinical Applications of Next Generation SequencingClinical Applications of Next Generation Sequencing
Clinical Applications of Next Generation Sequencing
Bell Symposium &amp; MSP Seminar
 
Genome assembly
Genome assemblyGenome assembly
RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the Transcriptome
Sean Davis
 

What's hot (20)

RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
Arms 2
Arms 2Arms 2
Arms 2
 
Lecture 7 gwas full
Lecture 7 gwas fullLecture 7 gwas full
Lecture 7 gwas full
 
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5
 
Next Generation Sequencing of DNA
Next Generation Sequencing of DNANext Generation Sequencing of DNA
Next Generation Sequencing of DNA
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation Sequencing
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
 
Ngs introduction
Ngs introductionNgs introduction
Ngs introduction
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
Digital PCR.pptx
Digital PCR.pptxDigital PCR.pptx
Digital PCR.pptx
 
Introduction to real-Time Quantitative PCR (qPCR) - Download the slides
Introduction to real-Time Quantitative PCR (qPCR) - Download the slidesIntroduction to real-Time Quantitative PCR (qPCR) - Download the slides
Introduction to real-Time Quantitative PCR (qPCR) - Download the slides
 
Principle and workflow of whole genome bisulfite sequencing
Principle and workflow of whole genome bisulfite sequencingPrinciple and workflow of whole genome bisulfite sequencing
Principle and workflow of whole genome bisulfite sequencing
 
qRT-PCR.pdf
qRT-PCR.pdfqRT-PCR.pdf
qRT-PCR.pdf
 
Sanger sequencing
Sanger sequencingSanger sequencing
Sanger sequencing
 
Clinical Applications of Next Generation Sequencing
Clinical Applications of Next Generation SequencingClinical Applications of Next Generation Sequencing
Clinical Applications of Next Generation Sequencing
 
Genome assembly
Genome assemblyGenome assembly
Genome assembly
 
RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the Transcriptome
 

Similar to Whole exome sequencing data analysis.pptx

Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Prof. Wim Van Criekinge
 
160628 giab for festival of genomics
160628 giab for festival of genomics160628 giab for festival of genomics
160628 giab for festival of genomics
GenomeInABottle
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GenomeInABottle
 
Large Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVSLarge Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVS
Golden Helix
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
Eli Kaminuma
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
GenomeInABottle
 
20160219 - S. De Toffol - Dal Sanger al NGS nello studio delle mutazioni BRCA
20160219 - S. De Toffol -  Dal Sanger al NGS nello studio delle mutazioni BRCA �20160219 - S. De Toffol -  Dal Sanger al NGS nello studio delle mutazioni BRCA �
20160219 - S. De Toffol - Dal Sanger al NGS nello studio delle mutazioni BRCA
Roberto Scarafia
 
Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...
OECD Environment
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
GenomeInABottle
 
Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015
Kim D. Pruitt
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
GenomeInABottle
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
GenomeInABottle
 
Mar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupMar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working Group
GenomeInABottle
 
2012 10-24 - ngs webinar
2012 10-24 - ngs webinar2012 10-24 - ngs webinar
2012 10-24 - ngs webinar
Elsa von Licy
 
Tools for Using NIST Reference Materials
Tools for Using NIST Reference MaterialsTools for Using NIST Reference Materials
Tools for Using NIST Reference Materials
GenomeInABottle
 
16S MVRSION at Washington University
16S MVRSION at Washington University16S MVRSION at Washington University
16S MVRSION at Washington University
Seth Crosby
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Nathan Olson
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
GenomeInABottle
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
COST action BM1006
 
GRM 2013: Genome-Wide Selection Update -- RK Varshney and A Rathore
GRM 2013: Genome-Wide Selection Update -- RK Varshney and A RathoreGRM 2013: Genome-Wide Selection Update -- RK Varshney and A Rathore
GRM 2013: Genome-Wide Selection Update -- RK Varshney and A Rathore
CGIAR Generation Challenge Programme
 

Similar to Whole exome sequencing data analysis.pptx (20)

Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
160628 giab for festival of genomics
160628 giab for festival of genomics160628 giab for festival of genomics
160628 giab for festival of genomics
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
Large Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVSLarge Scale PCA Analysis in SVS
Large Scale PCA Analysis in SVS
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
 
20160219 - S. De Toffol - Dal Sanger al NGS nello studio delle mutazioni BRCA
20160219 - S. De Toffol -  Dal Sanger al NGS nello studio delle mutazioni BRCA �20160219 - S. De Toffol -  Dal Sanger al NGS nello studio delle mutazioni BRCA �
20160219 - S. De Toffol - Dal Sanger al NGS nello studio delle mutazioni BRCA
 
Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...Overview of the commonly used sequencing platforms, bioinformatic search tool...
Overview of the commonly used sequencing platforms, bioinformatic search tool...
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
Mar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupMar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working Group
 
2012 10-24 - ngs webinar
2012 10-24 - ngs webinar2012 10-24 - ngs webinar
2012 10-24 - ngs webinar
 
Tools for Using NIST Reference Materials
Tools for Using NIST Reference MaterialsTools for Using NIST Reference Materials
Tools for Using NIST Reference Materials
 
16S MVRSION at Washington University
16S MVRSION at Washington University16S MVRSION at Washington University
16S MVRSION at Washington University
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
GRM 2013: Genome-Wide Selection Update -- RK Varshney and A Rathore
GRM 2013: Genome-Wide Selection Update -- RK Varshney and A RathoreGRM 2013: Genome-Wide Selection Update -- RK Varshney and A Rathore
GRM 2013: Genome-Wide Selection Update -- RK Varshney and A Rathore
 

Recently uploaded

Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 

Recently uploaded (20)

Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 

Whole exome sequencing data analysis.pptx

  • 1. Whole exome sequencing (WES|WXS) and its data analysis Feb 28, 2023 Haibo Liu Senior Bioinformatician UMass Medical School, Worcester, MA Email: haibol2017@gmail.com
  • 2. Eukaryotic Exome The human exome contains about 180,000 exons. These constitute about 1% of the human genome (~40 Mb).
  • 3. Exome sequencing • A NGS method that selectively sequences the transcribed regions of the genome. • Provides a cost-effective alternative to WGS • Produces a smaller, more manageable data set for faster, easier data analysis (4–5 Gb WES vs ~90 Gb WGS) • Identify both somatic and germline variants • Single Nucleotide Polymorphisms (SNPs) • Small Insertions-Deletions (indels) • Loss of Heterozygosity (LOH) • Copy Number Variants (CNVs), structural variants (SV) • Microsatellite stability
  • 4. Performance of WES in clinical studies
  • 6. Genotyping by Microarray, WES, and WGS (not updated, data analysis cost not included)
  • 7. Experimental design of WES • Tissue sampling • Somatic mutations • Tumor (tumor purity and freshness are critical) • Normal tissue or blood sample • Germline mutations • Blood or any other tissue • Sample size and sample population • cohort (disease vs health) • Trio, related family (non-carrier, carrier, and patient) • Capture methods • Sequencing strategies • platform, PE|SE, UMI, read length, seq. depth
  • 8. Rescue to DNA preparation from FFPE fixed samples
  • 9. Exome capture: Target-enrichment strategies Array-based capture https://en.wikipedia.org/wiki/Exome_sequencing • Twist Exome 2.0 (Twist Bioscience) • Nextera Rapid Capture Exomes (Illumina) • xGen WES (IDT) • SureSelect (Agilent) • KAPA HyperExome (Roche) • SeqCap (NimblGen) • … Capture toolkits
  • 10. UMI for detecting low frequency mutations for prenatal or cancer research The Cell3™ Target library preparation behind our whole exome enrichment incorporates error suppression technology. This includes unique molecular indexes (UMIs) and unique dual indexes (UDIs), to remove both PCR and sequencing errors and index hopping events. This error suppression technique, combined with our excellent uniformity of coverage, allows you to confidently and accurately call mutations down to 0.1% VAF and enables generation of sequencing libraries from as little as 1 ng cfDNA input.
  • 11. Comparison of different library preparation methods
  • 12. Comparison of different library preparation methods
  • 14. Quality control in WES Raw data QC BAM QC variant QC
  • 15. Raw data QC • QC tools • FastQC/MultiQC • NGS QC toolkit (https://github.com/mjain-lab/NGSQCToolkit) • QC-chain (contamination detection) • PRINSEQ • QC3 • Important QC metrics • Base quality • Nucleotide distribution along cycles • GC content distribution • Duplication rate • Adaptor content QC3
  • 16. Read trimming • Trimmomatic, cutadapt, fastp (auto adaptor detection), … • Quality/adaptor trimming • Don’t trim 5’ end (markduplicates)
  • 17. From raw fastq to analysis-ready BAM
  • 19. Selection of reference genomes • Completeness • Decoyed genome (1000 Genomes analysis pipeline) • EBV (herpesvirus 4 type 1, AC:NC_007605) and decoy sequences derived from HuRef, Human BAC and Fosmid clones and NA12878. (~36Mb) • T2T- CHM13v1.1, the latest, complete human reference genome
  • 20. Quality control in WES Raw data QC BAM QC variant QC
  • 21. BAM QC • Important QC metrics • % of reads that map to the reference • % of reads that map to the baits • Coverage depth distribution (target regions) • Coverage unevenness & Cohort Coverage Sparseness • Insert size distribution • Duplicate rate • Tools • Alfred • QC3 • Various picard CollectMetrics tools • covReport
  • 22. Cohort Coverage Sparseness (CCS) and Unevenness (UE) Scores for a detailed assessment of the distribution of coverage of sequence reads https://www.nature.com/articles/s41598-017-01005-x
  • 23. Local and global non-uniformity of different capture toolkits
  • 25. Differences from Capture toolkits https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4092227/
  • 26. Exome probe design is one of the major culprits • Most of the observed bias in modern WES stems from mappability limitations of short reads and exome probe design rather than sequence composition. https://www.nature.com/articles/s41598-020-59026-y
  • 27. Alfred QC metrics https://academic.oup.com/bioinformatics/article/35/14/2489/5232224 Alignment Metric DNA-Seq (WGS) DNA-Seq (Capture) RNA-Seq ChIP-Seq/ATAC-Seq Chart Type Mapping Statistics ✔ ✔ ✔ ✔ Table Duplicate Statistics ✔ ✔ ✔ ✔ Table Sequencing Error Rates ✔ ✔ ✔ ✔ Table Base Content Distribution ✔ ✔ ✔ ✔ Grouped Line Chart Read Length Distribution ✔ ✔ ✔ ✔ Line Chart Base Quality Distribution ✔ ✔ ✔ ✔ Line Chart Coverage Histogram ✔ ✔ ✔ ✔ Line Chart Insert Size Distribution ✔ ✔ ✔ ✔ Grouped Line Chart InDel Size Distribution ✔ ✔ ✔ ✔ Grouped Line Chart InDel Context ✔ ✔ ✔ ✔ Bar Chart GC Content ✔ ✔ ✔ ✔ Grouped Line Chart On-Target Rate ✔ Line Chart Target Coverage Distribution ✔ Line Chart TSS Enrichment ✔ Table DNA pitch / Nucleosome pattern ✔ Grouped Line Chart https://www.gear-genomics.com/docs/alfred/webapp/#featuresty-control)
  • 29. From BAM to VCF GATK:  Slop exon by 200 bp  Analysis for each chromosome
  • 30. Variant callers (Mutect2) (HaplotypeCaller) BreakSeq, LUMPY, Hydra,DELLY, CNVNator, Pindel FreeBayes/SAMtools, DeepVariant
  • 31. GATK Best practices for population- based germline variant calling
  • 32. GATK Mutect2 Best practices for population- based soMATIC variant calling
  • 33. Discrepancy of variants called by different callers
  • 34. Integrated variant calling • Integration of multiple tools’ results • Isma (integrative somatic mutation analysis) • Ensemble Machine learning method • BAYSIC • SomaticSeq • NeoMutate • SMuRF (Bartha and Gyorffy2019) (Nanni et al. 2019)
  • 35. Quality control in WES Raw data QC BAM QC variant QC
  • 36. Sample-level Variant QC • Tools • GATK CollectVariantCallingMetrics , VCFtools, PLINK/seq, QC3 • Important QC metrics • Ti/Tv ratio, nonsynonymous/synonymous, heterozygous/nonreference-homozygous (het/nonref-hom) ratio, mean depth, • Genotype missing rate • Genotype concordance to related data (different platforms) • Cross-sample DNA contamination (VerifyBamID) • Identity-by-descent (IBD) analysis (PLINK) • Related samples • PCA (EIGENSTRAT) • Population stratum (ethnicity) • Sex check (PLINK)
  • 37. Ti/Tv ratio and het/nonref-hom ratio • The Ti/Tv ratio varies greatly by genome region and functionality, but not by ancestry. • The het/nonref-hom ratio varies greatly by ancestry, but not by genome regions and functionality. • extreme guanine + cytosine content (either high or low) is negatively associated with the Ti/Tv ratio magnitude. • when performing QC assessment using these two measures, care must be taken to apply the correct thresholds based on ancestry and genome region. https://academic.oup.com/bioinformatics/article/31/3/318/2366248 Too low ==> high false positive rate; too high ==> bias.
  • 39. Potential error sources in next-generation sequencing workflow https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1659-6
  • 40. Origin of variant artifacts • Artifacts introduced by sample/library preparation • low-quality base calls (Read-end artifacts and other low Qual bases) • Alignment artifacts • Local misalignment near indels, • Erroneous alignments in low-complexity regions • Paralogous alignments of reads not well represented in the reference • Strand orientation bias artifacts (Strand Orientation Bias Detector (SOBDetector), Fisher score)-- https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-13-666 •
  • 41. Artifacts (https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-020-00791-w) Low base qual Read end Strand bias Low complexity misalignment Paralog misalgnment
  • 42. Variant-level QC • Important QC metrics • Genotype missing rate • Hardy-Weinberg Equilibrium (caution) p-value • Mendelian error rate • Allele balance of heterozygous calls • Variant quality score (GATK): filtering SNP and INDELS separately(https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard- filtering-germline-short-variants) • Hard filter • QualByDepth (QD) • FisherStrand (FS) • StrandOddsRatio (SOR) • RMSMappingQuality (MQ) • MappingQualityRankSumTest (MQRankSum) • ReadPosRankSumTest (ReadPosRankSum) • Machine learning-based filtering: Variant Quality Score Recalibration --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "my_snp_filter"
  • 43. Variant-level filtering • Tools • GATK, VCFtools, PLINK/Seq • Sequencing data-based filtering • Exclude potential artifacts • Database-based filtering: • Exclude known variants which are present in public SNP databases, published studies or in-house databases as it is assumed that common variants represent harmless variations • Pedigree-based filtering • Each generation introduces up to 4.5 deleterious mutations, it might be as well that a de novo mutation is causing the disease. • Function-based filtering • Caution: risk removing the pathogenic variant
  • 45. Allelic balance • SLIVAR: genotype quality, sequencing depth, allele balance, and population allele frequency : https://github.com/brentp/slivar https://onlinelibrary.wiley.com/doi/full/10.1002/humu.23674
  • 46. Variant annotation tools VAT Annotation of variants by functionality in a cloud computing environment.
  • 48. Functional predictors/Prioritization tools snpSift http://pcingola.github.io/SnpEff/ss_introduction/ (Hintzsche et al., 2016) VAAST https://github.com/Yandell-Lab/VVP-pub VarSifter, VarSight gNome, KGGseq
  • 49. (Cheng et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. SCIENCE, 19 Sep 2023Vol 381, Issue 6664,DOI: 10.1126/science.adg7492) Latest, advanced AI tool for infer effect of missense mutations: AlphaMissense (Hintzsche et al., 2016)
  • 50. Tools and resources for linking variants to therapeutics
  • 52. Oncoprint for visualizing cohort variants
  • 55. WES and its data analysis
  • 56.
  • 57. WES data analysis pipelines • DRAGEN (Illumina) • https://www.illumina.com/products/by-type/informatics- products/basespace-sequence-hub/apps/dragen-enrichment.html • JWES
  • 58. • A high-performance commercial solution (https://www.sentieon.com/products/) • improves upon BWA, STAR, Minimap2, GATK, HaplotypeCaller, Mutect, and Mutect2 based pipelines and is deployable on any generic-CPU-based computing system WES data analysis pipelines