`
Tim Mercer
Genome In A Bottle - Sept 16th
Representing the human genome
with synthetic spike-in controls.
DISCLAIMER: The Garvan Institute of Medical Research has filed patent
applications on some techniques described in this study.
Tim Mercer
Garvan Institute for Medical Research
Human Genome
Reverse Genome
Human Genome
5’ to 3’
Synthetic Genome
3’ to 5’
Human Genome Reverse Genome
5’ to 3’ 3’ to 5’
less than 1%
Cross-Alignment
(low-complexity sequences)
HUMAN (FWD)
simulated
(101nt, paired)
SYNTHETIC (REV)
simulated
(101nt, paired)
NA12878
(Illumina platinum
genomes,
101nt, paired)
-20
-15
-10
-5
0
Populationfraction(Log2)Populationfraction(Log2)Populationfraction(Log2)
Populationfraction(Log2)
Unmapped
MapQ=0
MapQ=1-9
MapQ=10-59
MapQ=60
0 20 40 60
-15
-10
-5
0
Populationfraction(Log2)
MapQ score
0 20 40 60
MapQ score
-20
-15
-10
-5
0
0 20 40 60
-20
-15
-10
-5
0
0 20 40 60
MapQ score
-20
-15
-10
-5
0
Populationfraction(Log2)
0 20 40 60
MapQ score
0 20 40 60
-10
-5
0
MapQ score
-15
MapQ score
HUMAN GENOME (5’ to 3’) SYNTHETIC GENOME (3’ to 5’)
LIBRARY:
Read
Alignments
Split-Reads
Discordant
Alignments
Duplication Duplication
Human Genome
5’ to 3’
Synthetic Genome
3’ to 5’
NGS reads from human genome and the mirror genome 

have the same alignment properties (direction agnostic).
Human Genome (5’ to 3’) Reverse Genome (3’ to 5’)
SPLICED GENES
FUSION GENES
IMMUNE RECEPTORS
GENETIC VARIATION
PRIMER SITES
STRUCTURAL VARIATION
REPEAT DNA
ONE COPYHALF COPY HALF COPY
RNA DNA
Size-selection
Purification
In Vitro Transcription Digestion
Size-selection
Purification
Sequin Manufacture
RNA sequins (left) by in vitro transcription and purified.
DNA sequins (right) by restriction digestion, and purified.
Expected Abundance
ObservedAbundance
Expected Abundance
ObservedAbundance
Mix A
Mix B
Variable Sequins
(Measure differences between samples)
Constant Sequins
(Normalise between samples)
Mixtures
Individual (RNA or DNA) sequins are combined to emulate quantitive features (eg. gene
expression, splicing, allele frequency) and establish internal reference ladders.
Expected Abundance
ObservedAbundance
ObservedAbundance
Expected Abundance
Mix A
Mix B
Mixture A Mixture B Fold-difference
Variable Sequins
(Measure differences between samples)
Mix A Mix B Fold-Change
Mixture Accuracy
We can measure the variation between five replicates due to:
1) Independent sources (due to mixture prep.) ~0.027sd
2) Dependent sources (sequence specific etc.) ~0.285sd
0.0 0.5 1.0 1.5 2.0
0.0
0.5
1.0
1.5
2.0
Average normalized sequin abundance
Normalizedsequinabundances
inindependentmixtures
Mix 2
Mix 3
Mix 4
Mix 5
Systematic variationIndependent
variation (pipetting)
Independent
variation (pipetting)
Mix 1
Sequins are added to a RNA/DNA sample

at a fractional concentrations (typically 2-3%).
The combined sample is then sequenced, with a proportional
fraction of the reads derived from sequins in the final library.
To distinguish reads in the library that derive from sequins, we align the library to a
combined index comprising both the human genome (hg38) and also the mirror genome.
Human Genome
Synthetic Genome (Reversed)
A
A
A
A
BRAF V600E
A
A
A
A
A
Re-reversing partitioned alignments visualised synthetic genome features
in the same direction as the human genome.
GENETIC VARIANTS
$
!
$
Sequence Read
Coverage
Alignments
Identified Variants
Homozygous Variation
Heterozygous Variation InDels
Sequin A
Sequin B
in silico Chromosome
Variant A
Variant B
Homozygous Variation
Heterozygous Variation
Sequin B
Sequin A
Manufacture sequins
Combine with genome DNA
for sequencing and analysis
1kb length
240 SNPs/Indels sampled from dbSNP
(Deveson et al., Nature Methods 2016)
v2 (available shortly)
1.8kb length
99 SNVs/Indels sampled from NA12878 high confidence (v2)
99 SNVs/Indels in difficult regions
(high & low GC, mono/di/tri nucleotide repeats)
heterozygous / homozygous as per annotation
v1
Library Preparation
Reference GenomeSynthetic Genome
SampleSequins
Sequencing
Alignment
Analysis Results
Example Workflow
Sequins added (~2%) to NA12878
genome DNA sample prior to
library preparation.
Undergo sequencing (125nt
paired-end Illumina to ~40x
coverage).
Calibrated Coverage
Subsample sequin alignments to calibrate precisely calibrate coverage (left, blue)
with the matched regions in the accompanying human genome (right, red).
40
60
80
40
60
80
Length (Percentile) Length (Percentile)
Edge effects Edge effects
Coverage(perbase)
Coverage(perbase)
Sequins Human Genome (NA12878)
0 25 50
0
50
100
0 25 50
0
50
100
Median Per-Base CoverageMedian Per-Base Coverage
Sensitivity(%)
Sensitivity(%)
Sequins
NA12878
"
"
"
"
"
"
"
"
!
!
!
!
!
!
!
#
Sequin SNV
Sequin Deletion
NA12878 SNV
NA12878 Deletion
Single Nucleotide Variation (SNV) Insertion/Deletion (Indel)
Sequins
NA12878
Germline Variation
Synthetic heterozygous variants detected comparably to human variation (using the
NA12878 reference annotations) across range of sequence coverage (1-50x).
Somatic Mutations
By titrating ‘variant’ sequins relative to ‘reference’ sequins, we can establish the range
of somatic mutation frequencies observed in complex tumour sub-populations.
Sequin Frequencies
FOXP 1/FLT3
FLT3/IDH1
CXCL17
TP53
IDH2/RUNX1
Griffith et. al., Cell Systerms 2015
Sensitivity
Assess quantitive accuracy of measuring allele frequency with NGS:
1) Limit of Quantification indicates the minimum allele frequency required for accurate
quantification.
2) Correlation and slope describe quantitive accuracy and biases of NGS assay.
HeterozygousFrequency
-12
-9
-6
-3
0
-9 -6 -3
Expected Allele Frequency (log2)
ObservedAlleleFrequency(log2)
LimitOfQuantification
Intercept: -0.0619612
Slope: 1.08278
R2: 0.943421
Precision
Detection of false positive variants (from sequencing error, misalignments etc.) in
sequins enables an estimate of specificity (precision).
$
!
$
Sequence Read
Coverage
Alignments
Heterozygous Variation
Sequin A
Sequin B
True Positive
False Positive
(Sequencing or alignment error?)
Sequins are a simple and effective method to

measure diagnostic power of NGS ibrary.
0 -5 -10 -15
1:1
1:2
1:4
1:8
1:16
1:32
1:64
1:128
1:256
1:512
1:1,024
1:2,048
1:4,096
Allele
Frequency
0.00 0.25 0.50 0.75 1.00
0.00
0.25
0.50
0.75
1.00
0.0
0.2
0.4
0.6
0.8
1.0
1.0000
1.0000
0.9962
0.9934
1.0000
0.9968
1.0000
0.9834
0.9382
0.6408
0.5495
0.1683
0.1173
AUC
value
Precision (Cumulative Fraction)
Sensitivity (True-Positive)
Precision (False-Positive)
CumulativeFraction
False-Positive Rate
TruePositiveRate
Expected Allele Frequency (Log2)
Test Precision Test Diagnostic
RnaAlign
(Alignment performance)
RnaExpression
(Gene, Isoform and Exon Expression)
RnaFoldChange
(Differential Gene Expression)
plotLinear
(Gene Expression)
plotLOD
(Fold-change sensititivty)
plotROC
(Fold-change sensititivty)
RnaSubsample
(Calibration of Multiple Samples)
RnaAssembly
(Isoform Assembly)
plotLogistic
(Isoform Assembly)
Library Preparation
(polyA, ribo-depletion etc.)
Next-Generation
Sequencing
User’s RNA
Sample
RNA Sequin
Controls
Combined Sample
(with 2-3% sequins)
Spike In
.FASTQ
ANAQUIN in C++
ANAQUIN in R
LABORATORY PROTOCOL
Alignment
(eg. BWA,BowTie2,Tophat2,STAR)
Gene Assembly
(eg. Cufflinks,StringTie)
Normalisation
Gene Expression
(eg. Cufflinks,Kallisto,
DESeq2,edgeR)
.BAM,.SAM
.BAM*,.SAM*
.VCF,.TXT
.GTF,.TXT
RNA-SEQ BIOINFORMATICS PIPELINE
Human Genome
(hg38)
In Silico
Chromosome
x
y
Diagnostic
Statistics
Inter-Sample
Normalisation
Reference
Ladders
Output
Report
Assess
Performance
ANAQUIN - SEQUIN ANALYSIS TOOLKIT
Anaquin software toolkit for the analysis of sequins that integrates with NGS
analytical pipelines, supports standard formats and common bioinformatic tools.
SEQUINS ARE FREE FOR NON-PROFIT RESEARCH,
request an aliquot from www.sequin.xyz
Acknowledgments:
Ted Wong
Jim Blackburn
Ira Deveson
Bindu Kanakamedala
Simon Hardwick
Wendy Chen
James Ferguson
John Mattick
Katrina Frankcombe
Peter Whitfield
Further Reading
‘Representing genetic variation with synthetic DNA standards.’
by Deveson et al., (2016) Nature Methods
‘Spliced synthetic genes as internal controls in RNA sequencing experiments’
by Hardwick et al., (2016) Nature Methods

Sept2016 plenary mercer_sequins

  • 1.
    ` Tim Mercer Genome InA Bottle - Sept 16th Representing the human genome with synthetic spike-in controls. DISCLAIMER: The Garvan Institute of Medical Research has filed patent applications on some techniques described in this study. Tim Mercer Garvan Institute for Medical Research
  • 2.
  • 3.
    Human Genome 5’ to3’ Synthetic Genome 3’ to 5’ Human Genome Reverse Genome 5’ to 3’ 3’ to 5’ less than 1% Cross-Alignment (low-complexity sequences)
  • 4.
    HUMAN (FWD) simulated (101nt, paired) SYNTHETIC(REV) simulated (101nt, paired) NA12878 (Illumina platinum genomes, 101nt, paired) -20 -15 -10 -5 0 Populationfraction(Log2)Populationfraction(Log2)Populationfraction(Log2) Populationfraction(Log2) Unmapped MapQ=0 MapQ=1-9 MapQ=10-59 MapQ=60 0 20 40 60 -15 -10 -5 0 Populationfraction(Log2) MapQ score 0 20 40 60 MapQ score -20 -15 -10 -5 0 0 20 40 60 -20 -15 -10 -5 0 0 20 40 60 MapQ score -20 -15 -10 -5 0 Populationfraction(Log2) 0 20 40 60 MapQ score 0 20 40 60 -10 -5 0 MapQ score -15 MapQ score HUMAN GENOME (5’ to 3’) SYNTHETIC GENOME (3’ to 5’) LIBRARY:
  • 5.
    Read Alignments Split-Reads Discordant Alignments Duplication Duplication Human Genome 5’to 3’ Synthetic Genome 3’ to 5’ NGS reads from human genome and the mirror genome 
 have the same alignment properties (direction agnostic). Human Genome (5’ to 3’) Reverse Genome (3’ to 5’)
  • 6.
    SPLICED GENES FUSION GENES IMMUNERECEPTORS GENETIC VARIATION PRIMER SITES STRUCTURAL VARIATION REPEAT DNA ONE COPYHALF COPY HALF COPY
  • 7.
    RNA DNA Size-selection Purification In VitroTranscription Digestion Size-selection Purification Sequin Manufacture RNA sequins (left) by in vitro transcription and purified. DNA sequins (right) by restriction digestion, and purified.
  • 8.
    Expected Abundance ObservedAbundance Expected Abundance ObservedAbundance MixA Mix B Variable Sequins (Measure differences between samples) Constant Sequins (Normalise between samples) Mixtures Individual (RNA or DNA) sequins are combined to emulate quantitive features (eg. gene expression, splicing, allele frequency) and establish internal reference ladders. Expected Abundance ObservedAbundance ObservedAbundance Expected Abundance Mix A Mix B Mixture A Mixture B Fold-difference Variable Sequins (Measure differences between samples) Mix A Mix B Fold-Change
  • 9.
    Mixture Accuracy We canmeasure the variation between five replicates due to: 1) Independent sources (due to mixture prep.) ~0.027sd 2) Dependent sources (sequence specific etc.) ~0.285sd 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 Average normalized sequin abundance Normalizedsequinabundances inindependentmixtures Mix 2 Mix 3 Mix 4 Mix 5 Systematic variationIndependent variation (pipetting) Independent variation (pipetting) Mix 1
  • 10.
    Sequins are addedto a RNA/DNA sample
 at a fractional concentrations (typically 2-3%). The combined sample is then sequenced, with a proportional fraction of the reads derived from sequins in the final library.
  • 11.
    To distinguish readsin the library that derive from sequins, we align the library to a combined index comprising both the human genome (hg38) and also the mirror genome.
  • 12.
    Human Genome Synthetic Genome(Reversed) A A A A BRAF V600E A A A A A Re-reversing partitioned alignments visualised synthetic genome features in the same direction as the human genome.
  • 14.
  • 15.
    $ ! $ Sequence Read Coverage Alignments Identified Variants HomozygousVariation Heterozygous Variation InDels Sequin A Sequin B in silico Chromosome Variant A Variant B Homozygous Variation Heterozygous Variation Sequin B Sequin A Manufacture sequins Combine with genome DNA for sequencing and analysis
  • 16.
    1kb length 240 SNPs/Indelssampled from dbSNP (Deveson et al., Nature Methods 2016) v2 (available shortly) 1.8kb length 99 SNVs/Indels sampled from NA12878 high confidence (v2) 99 SNVs/Indels in difficult regions (high & low GC, mono/di/tri nucleotide repeats) heterozygous / homozygous as per annotation v1
  • 17.
    Library Preparation Reference GenomeSyntheticGenome SampleSequins Sequencing Alignment Analysis Results Example Workflow Sequins added (~2%) to NA12878 genome DNA sample prior to library preparation. Undergo sequencing (125nt paired-end Illumina to ~40x coverage).
  • 18.
    Calibrated Coverage Subsample sequinalignments to calibrate precisely calibrate coverage (left, blue) with the matched regions in the accompanying human genome (right, red). 40 60 80 40 60 80 Length (Percentile) Length (Percentile) Edge effects Edge effects Coverage(perbase) Coverage(perbase) Sequins Human Genome (NA12878)
  • 19.
    0 25 50 0 50 100 025 50 0 50 100 Median Per-Base CoverageMedian Per-Base Coverage Sensitivity(%) Sensitivity(%) Sequins NA12878 " " " " " " " " ! ! ! ! ! ! ! # Sequin SNV Sequin Deletion NA12878 SNV NA12878 Deletion Single Nucleotide Variation (SNV) Insertion/Deletion (Indel) Sequins NA12878 Germline Variation Synthetic heterozygous variants detected comparably to human variation (using the NA12878 reference annotations) across range of sequence coverage (1-50x).
  • 20.
    Somatic Mutations By titrating‘variant’ sequins relative to ‘reference’ sequins, we can establish the range of somatic mutation frequencies observed in complex tumour sub-populations. Sequin Frequencies FOXP 1/FLT3 FLT3/IDH1 CXCL17 TP53 IDH2/RUNX1 Griffith et. al., Cell Systerms 2015
  • 21.
    Sensitivity Assess quantitive accuracyof measuring allele frequency with NGS: 1) Limit of Quantification indicates the minimum allele frequency required for accurate quantification. 2) Correlation and slope describe quantitive accuracy and biases of NGS assay. HeterozygousFrequency -12 -9 -6 -3 0 -9 -6 -3 Expected Allele Frequency (log2) ObservedAlleleFrequency(log2) LimitOfQuantification Intercept: -0.0619612 Slope: 1.08278 R2: 0.943421
  • 22.
    Precision Detection of falsepositive variants (from sequencing error, misalignments etc.) in sequins enables an estimate of specificity (precision). $ ! $ Sequence Read Coverage Alignments Heterozygous Variation Sequin A Sequin B True Positive False Positive (Sequencing or alignment error?)
  • 23.
    Sequins are asimple and effective method to
 measure diagnostic power of NGS ibrary. 0 -5 -10 -15 1:1 1:2 1:4 1:8 1:16 1:32 1:64 1:128 1:256 1:512 1:1,024 1:2,048 1:4,096 Allele Frequency 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.0 0.2 0.4 0.6 0.8 1.0 1.0000 1.0000 0.9962 0.9934 1.0000 0.9968 1.0000 0.9834 0.9382 0.6408 0.5495 0.1683 0.1173 AUC value Precision (Cumulative Fraction) Sensitivity (True-Positive) Precision (False-Positive) CumulativeFraction False-Positive Rate TruePositiveRate Expected Allele Frequency (Log2) Test Precision Test Diagnostic
  • 24.
    RnaAlign (Alignment performance) RnaExpression (Gene, Isoformand Exon Expression) RnaFoldChange (Differential Gene Expression) plotLinear (Gene Expression) plotLOD (Fold-change sensititivty) plotROC (Fold-change sensititivty) RnaSubsample (Calibration of Multiple Samples) RnaAssembly (Isoform Assembly) plotLogistic (Isoform Assembly) Library Preparation (polyA, ribo-depletion etc.) Next-Generation Sequencing User’s RNA Sample RNA Sequin Controls Combined Sample (with 2-3% sequins) Spike In .FASTQ ANAQUIN in C++ ANAQUIN in R LABORATORY PROTOCOL Alignment (eg. BWA,BowTie2,Tophat2,STAR) Gene Assembly (eg. Cufflinks,StringTie) Normalisation Gene Expression (eg. Cufflinks,Kallisto, DESeq2,edgeR) .BAM,.SAM .BAM*,.SAM* .VCF,.TXT .GTF,.TXT RNA-SEQ BIOINFORMATICS PIPELINE Human Genome (hg38) In Silico Chromosome x y Diagnostic Statistics Inter-Sample Normalisation Reference Ladders Output Report Assess Performance ANAQUIN - SEQUIN ANALYSIS TOOLKIT Anaquin software toolkit for the analysis of sequins that integrates with NGS analytical pipelines, supports standard formats and common bioinformatic tools.
  • 25.
    SEQUINS ARE FREEFOR NON-PROFIT RESEARCH, request an aliquot from www.sequin.xyz
  • 26.
    Acknowledgments: Ted Wong Jim Blackburn IraDeveson Bindu Kanakamedala Simon Hardwick Wendy Chen James Ferguson John Mattick Katrina Frankcombe Peter Whitfield Further Reading ‘Representing genetic variation with synthetic DNA standards.’ by Deveson et al., (2016) Nature Methods ‘Spliced synthetic genes as internal controls in RNA sequencing experiments’ by Hardwick et al., (2016) Nature Methods