Ngs part iii 2013

Sample & Assay Technologies

Next Generation Sequencing:
Data analysis for genetic profiling
Ravi Vijaya Satya, Ph.D.
Senior Scientist, R&D
Ravi.VijayaSatya@QIAGEN.com


Welcome to the three-part webinar series
Next Generation Sequencing and its role in cancer biology

Webinar 1: Next-generation sequencing, an introduction to technology and
applications
Date:
April 4, 2013
Speaker:
Quan Peng, Ph.D.

Webinar 2:
Date:
Speaker:

Next-generation sequencing for cancer research
April 11, 2013
Vikram Devgan, Ph.D., MBA

Webinar 3:
Date:
Speaker:

Next-generation sequencing data analysis for genetic profiling
April 18, 2013

Title, Location, Date

2


Agenda

NGS Data Analysis
Read Mapping
Variant Calling
Variant Annotation
Targeted Enrichment
GeneRead Gene Panels
GeneRead Data Analysis Portal
Background
Workflow
Data Interpretation

3


Read Mapping
Reads mapped to a reference genome

Millions of reads
from a single run

Alignment
Mapping Quality

Programs for read-mapping
Hash-based
MAQ, ELAND, SOAP, Novoalign
Suffix array/Burrows Wheeler Transform based
BWA, BowTie, BowTie2, SOAP2


4


Variant Calling
Determine if there is enough statistical support to call a variant

Reference sequence
ACAGTTAAGCCTGAACTAGACTAGGATCGTCCTAGATAGTCTCGATAGCTCGATATC
Aligned reads
AACTAGACTAGGATCGTCCTAGATAGTCTCG
AACTAGACTAGGATCGTCCTACATAGTCTCG
AACTAGACTAGGATCGTCCTACATAGTCTCG
GATCGTCCTAGATAGTCTCGATAGCTCGAT
Multiple factors are considered in calling variants
No. of reads with the variant
Mapping qualities of the reads
Base qualities at the variant position
Strand bias (variant is seen in only one of the strands)
Variant Calling Software
GATK Unified Genotyper, Torrent Variant Caller, SamTools, Mutect, …

5


Variant Representation
VCF – Variant Call Format

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

Header lines
##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been
filtered">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to
detect strand bias">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=OND,Number=1,Type=Float,Description="Overall non-diploid ratio (alleles/(alleles+nonalleles))">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##contig=<ID=chrM,length=16571,assembly=hg19>
##contig=<ID=chr1,length=249250621,assembly=hg19>
Column labels
#CHROM POS
ID
REF ALT
QUAL FILTER
INFO
FORMAT
Sample
chr1
11181327 rs11121691 C
T
100.0 PASS
DP=1000;MQ=87.67 GT:AD:DP 0/1:146,45:191
chr1
11190646 rs2275527
G
A
100.0 PASS
DP=1000;MQ=67.38 GT:AD:DP 0/1:462,121:583
chr1
11205058 rs1057079
C
T
100.0 PASS
DP=1000;MQ=79.57 GT:AD:DP 0/1:49,143:192

Variant calls


6

Variant Annotation


dbSNP/COSMIC ID
Chro
m
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr1
chr2

R
ef
11181327 rs11121691 C
11190646 rs2275527 G
11205058 rs1057079 C
11288758 rs1064261 G
11300344 rs191073707 C
11301714 rs1135172 A
11322628 rs2295080 G
186641626 rs2853805 G
186642429 rs2206593 A
186643058
rs5275
A
186645927 rs2066826 C
29415792 rs1728828 G
Pos

ID

Alt
T
A
T
A
T
G
T
A
G
G
T
A

Actual change and position within
the codon or amino acid sequence
Gene Mutation
Name type
MTOR SNP
MTOR SNP
MTOR SNP
MTOR SNP
MTOR SNP
MTOR SNP
MTOR SNP
PTGS2 SNP
PTGS2 SNP
PTGS2 SNP
PTGS2 SNP
ALK
SNP

chr2 29416366

rs1881421 G C

ALK

SNP

chr2 29416481

rs1881420 T C

ALK

SNP

chr2 29416572

rs1670283 T C

ALK

SNP

chr2 29419591
chr2 29445458
chr2 29446184

rs1670284 G T
rs3795850 G T
rs2276550 C G

ALK
ALK
ALK

SNP
SNP
SNP

Effect of the variant
on protein coding

Codon
AA
Filtered
Variant
Allele Frequency
snpEff Effect
Change
Change Coverage
Frequency
c.6909C>T p.L2303
C=0.761 T=0.239
SYNONYMOUS_CODING
1,924
0.239
c.5553G>A p.S1851
G=0.791 A=0.208
SYNONYMOUS_CODING
5,842
0.208
c.4731C>T p.A1577
C=0.254 T=0.746
SYNONYMOUS_CODING
1,928
0.746
c.2997G>A p.N999
G=0.212 A=0.788
SYNONYMOUS_CODING
5,186
0.788
C=0.924 T=0.076
INTRON
210
0.076
c.1437A>G p.D479
A=0.248 G=0.752
SYNONYMOUS_CODING
3,965
0.752
G=0.239 T=0.755
UPSTREAM
339
0.755
G=0.0 A=1.0
UTR_3_PRIME
97
1
A=0.167 G=0.833
UTR_3_PRIME
3,552
0.833
A=0.759 G=0.241
UTR_3_PRIME
237
0.241
C=0.88 T=0.12
INTRON
209
0.12
G=0.0 A=1.0
UTR_3_PRIME
2,520
1
NON_SYNONYMOUS_CODI
c.4587G>C p.D1529E
G=0.907 C=0.093
NG
4,361
0.093
NON_SYNONYMOUS_CODI
c.4472T>C p.K1491R
T=0.954 C=0.045
NG
3,061
0.045
NON_SYNONYMOUS_CODI
c.4381T>C p.I1461V
T=0.0 C=0.999
NG
5,834
0.999
G=0.093 T=0.907
INTRON
739
0.907
c.3375G>T p.G1125
G=0.917 T=0.082
SYNONYMOUS_CODING
1,776
0.082
C=0.895 G=0.105
INTRON
475
0.105

SIFT score
Predicts the deleterious effect of an amino acid change based on how conserved the
sequence is among related species
Polyphen score
Predicts the impact of the variant on protein structure

7


GeneRead DNAseq Gene Panel: Targeted Sequencing

What is targeted sequencing?
Sequencing a sub set of regions in the whole-genome

Why do we need targeted sequencing?
Not all regions in the genome are of interest or relevant to specific study
Exome Sequencing: sequencing most of the exonic regions of the genome (exome).
Protein-coding regions constitute less than 2% of the entire genome
Focused panel/hot spot sequencing: focused on the genes or regions of interest

What are the advantages of focused panel sequencing?
More coverage per sample, more sensitive mutation detection
More samples per run, lower cost per sample


8


Target Enrichment - Methodology

Multiplex PCR
Small DNA input (< 100ng)
Short processing time
(several hrs)
Relatively small throughput
(KB - MB region)

Sample
preparation
(DNA
isolation)

PCR target
enrichment
(2 hours)


Library
construction

Sequencing

Data analysis

9


Variants Identifiable through Multiplex PCR

SNPs – single nucleotide polymorphisms
Indels
Indels < 20 bp in length
Variants not callable
Structural variants
Large indels
Inversions
Copy Number Variants (CNVs)

Large insertion

Inversion

CNV


10


GeneRead DNAseq Gene Panel

Multiplex PCR technology based targeted enrichment for DNA sequencing
Cover all human exons (coding region + UTR)
Division of gene primers sets into 4 tubes; up to 1200 plex in each tube

11


GeneRead DNAseq Gene Panel
Focus on your Disease of Interest
Comprehensive Cancer Panel (124 genes)

Disease Focused Gene Panels (20 genes)
Breast cancer
Colon Cancer
Gastric cancer
Leukemia
Liver cancer
Genes Involved in Disease

Lung Cancer
Ovarian Cancer
Prostate Cancer

Genes with High Relevance
12


GeneRead DNAseq Custom Panel

13


Data Analysis for Targeted Sequencing
GeneRead data analysis work flow

Read
Mapping

Primer
Trimming

Variant
Calling

Variant
Annotation

Read mapping
Identify the possible position of the read within the reference
Align the read sequence to reference sequences
Primer trimming
Remove primer sequences from the reads
Variant calling
Identify differences between the reference and reads
Variant filtering and annotation
Functional information about the variant


14


Reads from Targeted Sequencing

Typical NGS raw read from targeted sequencing
Adapter

Barcode

Primer

Insert sequence

Primer

Adapter
-3’

5’Removal of adapters and de-multiplexing
Primer

Insert sequence

Primer
-3’

5’-3’

5’Read length can vary:
only part of the insert
5’or the 3’ primer may
be present
5’-

-3’
-3’


15


Read Mapping
Align reads to the reference genome

Reference sequence

Amplicon 1

Amplicon 2

Aligned reads


16

Primer Trimming


Primer sequences must be trimmed for accurate variant calling
Reference sequence

Amplicon 1

Frequency of `C` without
primer trimming = 4/13 = 31%

C
C
C
C

Aligned reads

Amplicon 2


Frequency of `C` after primer
trimming = 4/7 = 57%

17


GeneRead Variant Calling Overview

BAM file
(w/ flow space info)

BED file for
amplicons used

Run Parameters

Annotation

Variant calling and filtering

Torrent Variant
Caller (TVC)

vcf

GATK Variant
Annotator

IonTorrent

vcf

snpEff
(basic annotation)

vcf

Seq.
platform

vcf

GATK Unified
Genotyper

MiSeq

GATK Indel
Realigner
bam

bam

GATK Base
Quality
Recalibrator

GATK Variant
Filtration
vcf

Additional filtering
(based on
frequency and
coverage)

SnpSift
(links to dbSNP,
Cosmic and
computation of
Sift scores, etc.)
VCF to excel

dbSNP

Cosmic

dbNSFP

Variants in excel
format

18


Indel realignment

DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat
Genet. 2011 May; 13(5):191-8. PMID: 21178889

Eliminates some false-positive variant calls around indels
Read aligners can not eliminate these alignment errors since they align reads
independently
Multiple sequence alignment can identify these errors and correct them


19


Base Quality Recalibration

DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data.
Nat Genet. 2011 May; 13(5):191-8. PMID: 21178889

Eliminates sequencer-specific biases
Lane-specific/sample-specific biases
Instrument-specific under-reporting/over-reporting of quality scores
Systematic errors based on read position
Di-nucleotide-specific sequencing errors
Recalibrartion leads to improved variant calls


20


Variant Filtration

Variant Frequency
Somatic mode
SNPs with frequency < 4% and indels with frequency < 20%
Germline mode
SNPs with frequency < 20% and indels with frequency < 25%
Strand Bias
SNPs with FS ≤ 60
Indels with FS ≤ 200
Mapping Quality
SNPs with MQ ≤ 40.0

C
C
C

Haplotype Score
SNPs with HaplotypeScore ≤ 13.0
Not applicable for pooled samples


Strand Bias: variants that are
present in reads from only one of
the two strands

21


Specificity Analysis

Specificity: the percentage of sequences that map to the intended targets
region of interest
number of on-target reads / total number of reads

Reference
sequence

ROI 1

ROI 2

NGS
reads

Off-target reads

On-target reads

On-target
reads
22


Sequencing Depth

Coverage depth (or depth of coverage): how many times each base has been sequenced
Unlike Sanger sequencing, in which each sample is sequenced 1-3 times to be confident of
its nucleotide identity, NGS generally needs to cover each position many times to make a
confident base call, due to relative high error rate (0.1 - 1% vs 0.001 – 0.01%)
Increasing coverage depth is also helpful to identify low frequent mutations in heterogenous
samples such as cancer sample

Reference
sequence

NGS
reads

coverage depth = 4

coverage depth = 3

coverage depth = 2
23


NGS Data Analysis: Uniformity

Coverage uniformity: measure the evenness of the coverage depth of target
position

Reference
sequence

NGS
reads

coverage depth = 10

coverage depth = 3

coverage depth = 2


24


GeneRead Data Analysis Web Portal

FREE Complete & Easy to use Data Analysis with Web-based Software

25


GeneRead Data Analysis Web Portal


26



27



28


Note: Runtimes depend on the number of reads in the input files. Typical runtimes are 20-60 minutes.


29



30



31



32


Summary

Run Summary
Specificity
Coverage
Uniformity
Numbers of SNPs and Indels

Summary By Gene
Specificity
Coverage
Uniformity
# of SNPs and Indels

33


Features of Variant Report

SNP detection
Indel detection

34


QIAGEN’s GeneRead DNAseq Gene Panel System
FOCUS ON YOUR RELEVANT GENES
Focused:
Biologically relevant content
selection enables deep sequencing
on relevant genes and identification
of rare mutations
Flexible:
Mix and match any gene of interest
NGS platform independent:
Functionally validated for PGM,
MiSeq/HiSeq
Integrated controls:
Enabling quality control of prepared
library before sequencing
Free, complete and easy of use data
analysis tool


Upcoming webinars
Next Generation Sequencing and its role in cancer biology

Webinar 3:
Date:
Speaker:

Next-generation sequencing data analysis for genetic profiling
April 18, 2013

Webinar 1: Next-generation sequencing, an introduction to technology and
applications
Date:
May 3, 2013
Speaker:
Quan Peng, Ph.D.

Webinar 2:
Date:
Speaker:

Next-generation sequencing for cancer research
May 10, 2013
Vikram Devgan, Ph.D., MBA


36

Ngs part iii 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ngs part iii 2013

Similar to Ngs part iii 2013 (20)

More from Elsa von Licy

More from Elsa von Licy (20)

Ngs part iii 2013