Characterizing Alzheimer’s Disease candidate genes and transcripts with targeted, long-read, single-molecule sequencing

IDT and PacBio joint presentation—Characterizing
Alzheimer’s Disease candidate genes and transcripts
with targeted, long-read, single-molecule sequencing
Jenny Gu, PhD
Strategic Business Development Manager, PacBio
1

For Research Use Only. Not for use in diagnostics procedures. © Copyright 2017 by Pacific Biosciences of California, Inc. All rights reserved.
Characterizing Alzheimer’s Disease candidate genes and
transcripts with targeted, long-read, single-molecule
sequencing
September 27, 2017 / Jenny Gu, Ph.D.

AGENDA
-SMRT Sequencing technology overview
-Recommended IDT capture workflow for
SMRT Sequencing
-Case Study: Alzheimer’s Disease panel

ALZHEIMER’S DISEASE (AD)
Alzheimer’s disease is the most common form of neurodegenerative dementia.
https://www.alz.co.uk/research/WorldAlzheimerReport2015.pdf
Clinical characterization:
Progressive loss of memory and
deficits in thinking, problem
solving, and language
46.8M 131.5M
Neuropathological characterization:
Progressive cortical atrophy due to neuronal loss and
characteristic intracellular and extracellular deposits
of insoluble tau and amyloid β proteins
http://www.reverseagingcentre.com/media/links/signs-of-
alzheimers/
4

ALZHEIMER’S DISEASE (AD)
-Genetically divided into two different groups: early-onset and late-onset
-Relative risk for first degree relatives is 3.5 – 7.5
-30 – 48% of AD patients have an affected first-degree relative
Late-onset AD:
- Manifests after 65 years
- Multifactorial with strong genetic
predisposition
- GWAS have identified 20+ genetic risk
loci with small Odds Ratios (1.1 – 2.0
per risk allele) including both common
functional variants and rare and
structural variants
Early-onset AD:
- For 2 – 10% of patients first symptoms
occur in their 20s or 30s.
- Four genes account for 5 – 10% of
early onset AD:
-APP
PSEN1
PSEN2
APOE
The complex genetic makeup of AD
5

CANDIDATE DISEASE GENES IN ALZHEIMER’S DISEASE (AD)
Many associated genetic
loci contain several genes
Which candidates involved
in disease risk remains
unclear (20+ genes)
Strategies for assessing
GWAS candidate genes:
-DNA sequencing
-Transcriptome
sequencing
-Proteome studies
-Methylome studies
Cuyvers E. et al. (2016) Genetic variations underlying Alzheimer's disease: evidence from genome-wide association studies and beyond. Lancet Neurol. 15(8),857-68.
Several decade long search for risk genes in Alzheimer’s disease
6

SEQUEL SYSTEM
Typical Performance
-Average read length: 10 – 18 kb
-Consensus accuracy: Achieves QV50
-Throughput per cell: 5 – 8 Gb
-SMRT Cells per run: 1 – 16
-Movie lengths: 30 minutes – 10 hours
7

TYPICAL DATA
Read lengths >20 kb
Data per SMRT Cell: 5 – 8 Gb
Half of data in reads >20 kb
Top 5% of reads >35 kb
Maximum read lengths >60 kb
Read length data shown from 30 kb size-selected human library on the Sequel System (10-hour movie, 2.0
chemistry) with a total output of 7.6 Gb. Each Sequel System SMRT Cell 1M generates ~365,000 reads.
Read length (bp)
Reads(#)
8

BENEFITS OF LONG-READ SEQUENCING FOR
CHARACTERIZING GENOMIC STRUCTURAL VARIATION
Mechanisms underlying structural variant formation in genomic disorders. Carvalho CM et al. Nat Rev Genet. (2016)
Structural variation (SV) is an important
contributor to human diversity and disease
SV is also difficult to characterize
Example SV Types and Mechanisms
Targeted SMRT Sequencing allows scientists to
directly characterize:
• Complete Genes (introns & exons)
• Phased Variants (allelic haplotypes)
• Repetitive Regions
• Regulatory Regions (upstream/downstream)
• Insertions & Deletions
• Copy Number Variations
At high coverage for specific genes or regions of
interest across multiple samples.
9

GENETIC VARIATION SEQUENCING WITH SMRT SEQUENCING
1 10 100 1 kb 10 kb 100 kb 1 Mb 10 Mb 100 Mb
Size of Variant
VARIANT
TYPE
SNPs
Small
Indels
STRs &
VNTRs
Large
Insertions,
Deletions
Mobile
Elements
Complex
Variants
Phasing SVs
and SNVs
Indels
Repeat Expansions
One PacBio Read Spans Most Variants
Structural Variants
Phasing (SNVs and SVs)
Haplotype
Reconstruction
Assembled PacBio Reads Span Euchromatic Genome Variation
L1, Alu, SVA
Copy Number Variation
Inversions / Translocations
Phasing Phased Alleles
Medium to
Large SV’s
Haplotypes
Large Structural Rearrangement
10

ADDITIONALLY CHARACTERIZE TRANSCRIPTOME SPLICE
VARIATION WITH LONG-READ SEQUENCING
National Human Genome Research Institute. Bioinformatics: Finding genes. (2013) http://www.genome.gov/25020001
- Proteins and their functions are not only impacted by variants in exonic regions
- Variants in regulatory regions (enhancers/promoters, including methylation) and
intronic regions can also play an important role
- High transcript isoform diversity from alternative splicing
- Obtain full-length transcript sequences with Iso-Seq analysis
11

TRACE VARIANTS TO SPECIFIC ALLELES WITH PHASED
HETEROZYGOUS SNPS
12

CASE STUDY: VARIANT SCREENING IN ALZHEIMER’S DISEASE
WITH LONG-READ SEQUENCING
-Genomic and transcriptomic (cDNA) capture experiment
-Combined data provide better insight on variant-affected gene expression
-Gene panel applied to two AD patients (35 candidate genes):
• Average gDNA fragment size: ~6 kb
• Full-length transcripts ranging from <1 kb – ~10 kb
13

PACBIO TARGETED PROBE-BASED CAPTURE WORKFLOW
(GENOMIC DNA CAPTURE)
Shear to 7 kb
(6 kb for multiplex)
Amplification
Probe hybridization,
bead capture, wash
EXPERIMENTAL PIPELINE
INFORMATICS PIPELINE
Phasing with
SAMtools
Bin reads by
haplotype
Phased allelic
consensus
sequence
Tertiary
analysis
Map reads of
insert to
Reference
1 2 3 4 5
9 10 11 12 13
Size selection
3
5-9 kb
5-9 kb
6
Amplification and
SMRTbell prep.
+ Size selection
78
SequencingAnalysis
Genomic DNA
Ligate
barcoded
adapters
14

BEST PRACTICE SUMMARY: GENOMIC CAPTURE
-Save on project costs by multiplexing and spacing probes up to 1 kb.
-Multiplex up to 12 samples.
-Use PacBio linear barcoded adapters.
-High molecular weight DNA required.
-Size-selection highly recommended to max. on long-read recovery.
-Aim for 100-fold coverage of targeted panel size (full-length gene coverage).
15

10 kb shear
AD SAMPLES: SHEARED GDNA QC
Recommend starting with HMW gDNA (2 µg)
16

Final library size selected
SMRTBELL LIBRARY QC (SIZE-SELECTED)
17

GRCH38 SUBREAD MAPPING RESULTS
Skeletal muscle Brain
7.4 GB
2.2 M reads
8.4 GB
2.5 M reads
18

PACBIO TARGETED PROBE-BASED CAPTURE WORKFLOW
(TRANSCRIPTOME WITH SIZE SELECTION)
cDNA library
+ barcodes
Amplification
Probe hybridization,
bead capture, wash
EXPERIMENTAL PIPELINE
INFORMATICS PIPELINE
Tertiary
analysis
Iso-Seq
analysis
1 2 3 4 5
9 10
Size selection
(optional)
3
5-9 kb
6
Amplification and
SMRTbell prep.
78
SequencingAnalysis
mRNA
19

BEST PRACTICE SUMMARY: CDNA CAPTURE
-Recover high-quality RNA transcripts
-Size-selection is optional, but helpful for specific fractions.
-Targeted capture Iso-Seq analysis is recommended to characterize splice
isoforms
-Not recommended for characterizing gene expression levels
-Aim for min. 30-fold per anticipated splice isoform in samples
-Probes can be designed to exons only and/or including introns
20

AD SAMPLES: MRNA QC
RIN = 8.0
RIN = 8.1
Temporal lobe 1 RNA
Temporal lobe 2 RNA
Recommend RIN > 6
(RNA Integrity Number)
21

EXAMPLE WHOLE TRANSCRIPTOME SMRTBELL LIBRARY
(CDNA)
22

DESIGNING CUSTOM IDT XGEN® LOCKDOWN® CAPTURE PANEL
-Key benefit of xGen® Lockdown® Probes is flexibility in design
-Do not need to redesign existing probe panels
-However, recommend full-gene design by including introns and
exons, plus extra upstream and downstream sequences
-Probes can be spaced up to 1000 bp apart
-Use the same probes for genomic and cDNA capture
FULL-GENE DESIGN
Gene A
Gene B
23

67 2
3
39
319
154
312
SNPs AND LARGER SVs DISCOVERED IN AD SAMPLES
STUDY RESULTS:
Detected broad range of genomic
variants (SNPs and SVs):
-31 unique SVs ranging from 65 bp to
several kb in size
500+ Isoforms found in each patient
-Patient 1: 515 isoforms
-Patient 2: 507 isoforms
88% novel splice isoforms identified
-Only 39 isoform shared among both
patients and those reported in Gencode v25
24

RIN3 GENE: ~50 bp INSERTION DETECTED
25

ZCWPW1 GENE: ~750 bp DELETION DETECTED IN BOTH
PATIENTS
Patient 1
Patient 2
26

BACE1 GENE: PHASED ALLELES (34 KB)
Heterozygous SNPs can be used to phase alleles across multi-kilobase regions
Phase 0
Phase 1
Gene
Probes
Target
Phased
SNPs
27

BIN1 GENE: PHASED ALLELES (63 KB)
Heterozygous SNPs can be used to phase alleles across multi-kilobase regions
Gene
Probes
Target
Phased
SNPs
Phase 0
Phase 1
28

MAPT gene results:
-Detected a
heterozygous
deletion
-One allele is
transcribed into 21
isoforms and the
other only into 5
-Detected a novel
exon and
transcript
MAPT GENE RESULTS FOR PATIENT 1
21 isoforms
5 isoforms
Heterozygous genomic variants can be linked to
corresponding expressed transcripts
29

ZCWPW1 GENE: RETAINED INTRONS AND NEW EXONS
Patient
1
Patient
2
Retained intron
Novel exon
30

-AD has a large
economic impact on
the global society
(2010: $604B)
-To date, over 20+
putative genetic risk
variants have been
mapped
-Associated SNPs are
usually not the true
causative variant
CONCLUSION
-Combining gDNA and
cDNA data is more
informative
-Custom IDT xGen®
Lockdown® Panels
allow flexibility to scale
projects
-SMRT sequencing
provides multi-kilobase
phased alleles and full-
length transcripts
http://www.mvcenters.com/2015/02/11/dementia-
takes-toll-claims-another-american-great-dean-smith/
“Structural variants can be more informative for disease diagnostics,
prognostics and translation than current SNP mapping and exon sequencing.”
Roses A.D. et al. (2016) Structural variants can be more informative for disease diagnostics, prognostics and translation than current SNP mapping and exon sequencing. Expert Opin
Drug Metab Toxicol. 12(2),135-47.
31

Kevin Eng
Ting Hon
Elizabeth Tseng
Aaron Wenger
William Rowell
Jenny Ekholm
Steve Kujawa
ACKNOWLEDGEMENT
Kristina Giorda
Jiashi Wang
Mirna Jarosz
Visit PacBio Blog for new announcements and updates on Targeted Sequencing!
http://www.pacb.com/blog
http://www.pacb.com/applications/targeted-sequencing/
Feel free to contact ! Jenny Gu (jgu@pacb.com)

For Research Use Only. Not for use in diagnostics procedures. © Copyright 2017 by Pacific Biosciences of California, Inc. All rights reserved. Pacific Biosciences, the Pacific Biosciences logo,
PacBio, SMRT, SMRTbell, Iso-Seq, and Sequel are trademarks of Pacific Biosciences. BluePippin and SageELF are trademarks of Sage Science. NGS-go and NGSengine are trademarks of GenDx.
FEMTO Pulse and Fragment Analyzer are trademarks of Advanced Analytical Technologies. xGen and Lockdown are trademarks of Integrated DNA Technologies, Inc.
All other trademarks are the sole property of their respective owners.
www.pacb.com

gDNA Capture
Supplemental Information

PACBIO POLYMERASE READS
Skeletal muscle
Brain
35

SMRT LINK PROVIDES BASIC PROCESSING OF RAW DATA FOR
TARGETED CAPTURE ENRICHMENT STUDIES
SMRT Analysis produces:
-Filtered subreads
-Circular consensus sequences
-Alignment to reference (BAM files)
-Iso-Seq full-length transcripts
36

BIOINFORMATICS WORKFLOW FOR PHASING ALLELES
Github: Targeted phasing consensus (genomic capture)
Subreads
Raw data SMRTLink CCS reads SMRTLink
Aligned BAM
file
IGV 3.0
Visualize
capture2target.py
Defined
phase blocks
samtoolsPhased
alleles/region
cmdline:
PacBio arrow
1 2 3a 4 5
7
8
910
3b
11
Phased consensus
sequences
(*.fasta)
12
>99.9% accuracy
(dependent on coverage)
Data
SMRTLink
Command line tools
Third party software
Probe *.bed
6
Subset
and phase
Polish
37

Characterizing Alzheimer’s Disease candidate genes and transcripts with targeted, long-read, single-molecule sequencing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Characterizing Alzheimer’s Disease candidate genes and transcripts with targeted, long-read, single-molecule sequencing

Similar to Characterizing Alzheimer’s Disease candidate genes and transcripts with targeted, long-read, single-molecule sequencing (20)

More from Integrated DNA Technologies

More from Integrated DNA Technologies (20)

Recently uploaded

Recently uploaded (20)

Characterizing Alzheimer’s Disease candidate genes and transcripts with targeted, long-read, single-molecule sequencing