Ion Torrent Sequencing Applications: Variant Calling, Barcoding, and Long Range Mate Pairs

Ion Torrent Sequencing Applications:
Variant Calling, Barcoding, and Long
Range Mate Pairs

David Jenkins
Bioinformatics Engineer
EdgeBio

Contract Research Division
• Five SOLiD4 sequencing platforms
• One Life Techologies 5500XL
• Two Ion Torrent PGMs
• Automation thru Caliper Sciclone & Biomek FX
• Life Technologies Preferred Service Provider
• Agilent Certified Service Provider
• Commercial partnerships with companies such as CLCBio,
DNANexus and Genologics
• MD/PhD & Masters Level Scientists and Bioinformaticians
• IT Infrastructure of >100 CPUs and >100TB storage

Agenda
• Germ Line Variant Caller
• Barcoding
• Long Range Mate Pair Data

Variant Calling
• Goal: indentify SNPs
and INDELs
– High sensitivity
• Few false negatives
– High positive predictive
value
• Few false positives
• Challenge: distinguish
between homopolymer
sequencing error and
true INDELs

Variant Calling
• DH10B
• All identified variants are false positives
• PPV and sensitivity
• maq fakemut used to insert artificial mutations
– 220 SNPs and 239 INDELs
• EdgeBio 316 Chip Run
– 11.00x AQ17 coverage of genome
• Goal: identify most sensitive (true pos./[true pos. +
false neg.+) settings that don’t lose PPV (true pos./*true
pos. + false pos.])
– Identify the most variants while avoiding calling any non-
variants

Samtools Defaults vs. Variant Calling
Settings

• Default samtools setting not optimized for Ion
Torrent error model
– Lower base quality of candidates
– Coverage from both strands
– Strict requirements for homopolymers
• two sequences from both strands

PPV Corrected Sensitivity
Settings
Total SNPs INDELs Total SNPs INDELs

Samtools
Default 6.014% 96.682% 3.203% 100% 100% 100%
Settings

Q4, h100, o20,
e27, m1, H1
39.672% 100% 25.060% 98.690% 99.550% 97.910%

Q14, h100, o20,
e21, m1, H2
79.565% 100% 64.259% 92.810% 98.180% 89.870%
Q7, h50, o10,
e17, m4, H1
93.523% 100% 86.486% 91.720% 99.090% 84.940%
Q14, h50, o10,
e17, m4, H1
95.148% 100% 89.655% 90.850% 98.180% 84.100%

Variant
Calling 95.676% 100% 90.533% 90.650% 99.550% 83.260%
Settings
Q14, h50, o10,
e17, m4, H2
97.175% 100% 93.631% 89.540% 96.360% 83.260%

PPV and Sensitivity of Samtools Analyses
100.000%

80.000%

60.000%

40.000% Total PPV

SNPs PPV

INDELs PPV

20.000% Total Corrected Sensitivity

SNPs Corrected Sensitivity

INDELs Corrected Sensitivity

0.000%
Default Samtools h100, o20, e27, m1, H1 o20, e21, m4, H2 o10, e17, m4, H1
Q4, Q14, h75, Q7, h50, Q14, h50, o10, e17, m4, H1Variant CallingQ14, h50, o10, e17, m4, H2

Similar Results with Public DH10B Runs
PPV and Sensitivity of Public DH10B Runs
100.00%

80.00%

60.00%

Total PPV

SNP PPV
40.00%

INDEL PPV

Total Sensitivity
20.00%
SNP Sensitivity

INDEL Sensitivity

0.00%
Life Ion Torrent 314 Life Ion Torrent Life Ion Torrent 318 Life Ion Torrent Edge Bio Ion Life Ion Torrent 316 Life Ion Torrent
100MB 316LR DH10B Chip 314LR DH10B Torrent 316 DH10B DH10B 316LR DH10B >99%
accuracy

Homopolymer Mutated Reference Genome
Homopolymer PPV and Sensitivity
100.00%

80.00%

60.00%

Homopolymer PPV
40.00%

Hompolymer Sensitivity

20.00%

0.00%
Life Ion Torrent 314 Life Ion Torrent Life Ion Torrent 318 Life Ion Torrent Edge Bio Ion Life Ion Torrent 316 Life Ion Torrent
100MB 316LR DH10B Chip 314LR DH10B Torrent 316 DH10B DH10B 316LR DH10B >99%
per base accuracy
with long reads.

Conclusions
• Variant Calling plugin • Important to remember
able to identify >80% Variant Calling is
well-covered INDELs Application Specific
and >99% well-covered • Easy to re-run Germ
SNPs Line Variant Caller with
• Improves on custom settings.
performance of default • More information at
samtools settings by http://www.edgebio.com/blog/
avoiding false positive
SNPs and INDELs

Barcoding
• HuRef gDNA
• Compared read quality statistics with non-
barcoded run
• IonSet barcodes 5-8
• 11bp barcodes at beginning of the read

Barcoding
• 94.51% reads mapped
to barcodes used.
• Variant Calling Report
for Each Barcode
– New feature in 1.5.1
• Ion Community Feature
Requests
– Aligning barcodes to
different references
– Find out what
community wants

Quality Comparison
Barcoded hg19 Run (TS 1.5.1) Non-barcoded hg19 Run (TS 1.5.1)

Conclusions
• Similar quality between • 318 Chip and Barcoding
barcoded and non • Ion Torrent Community
barcoded runs – Technical details
• Robust set of barcodes – Desired Features
• Losing first 11 high – Troubleshooting
quality bases to the
barcode
– Explains lower initial
quality

Long Range Mate Pairs
• Data provided by Ion Torrent
• Average 10KB inserts
• Split sff files with sff_extract utility
• >IA_A
• CTGCTGTACGGCCAAGGCGGATGTACGGTACAGCAG
• >IA_B
• CTGCTGTACCGTACATCCGCCTTGGCCGTACAGCAG
• Can reads map successfully with average 10KB
inserts?
– Increasing homopolymers farther into read

Unsplit Reads
Metric Mbp
Total Number of Bases 404.65
Q17 Bases 207.67
Q20 Bases 150.07
Total Number of Reads 2,308,396
Mean length [bp] 175
Longest Read [bp] 365

From: http://flxlexblog.wordpress.com

Split Reads Metrics
2000000

1800000

1600000
Type Count Percent
1400000
Total Reads 2,308,396
1200000
Orphan Reads 220,707 9.561%
1000000
Partial Linker 106,913 4.631%
Multiple Linker 29 0.001% 800000

Too Short 1,757 0.076% 600000

Correctly Split 1,978,990 85.730% 400000

200000

0
Orphan Partial Multiple Too Short Correctly
Reads (1 Linker Linker Split Reads
seq) Found Occuracnes

Reads 1
• Per base sequence quality below Q20 after base 20
• Analysis performed pre TS 1.5 release
• Predicted base quality has improved
• Homopolymer enrichment relatively consistent across the read

Reads 2
• Per base sequence quality below Q20
• Second part of read in lower quality region of unsplit read
• Homopolymer enrichment still fairly uniform

Insert Size
bwa tmap

μ = 10189.78, σ = 1282.43 μ = 9751.20, σ = 2016.62

Mapping
AQ17 AQ20 Perfect
Total Number of
Bases [Mbp]
218.55 179.37 170.28
Mean Length [bp] 70 63 60
Longest Alignment
[bp]
173 171 167
Mean Coverage
Depth
46.6x 38.3x 36.3x
Percentage of
Library Covered
99.99% 99.99% 99.99%

Read >= 2
Reads Unmapped Excluded Clipped Perfect 1 mismatch
Length [bp] mismatches

50 3,240,310 15,981 20 0 1,810,959 744,229 669,121
100 349,925 1,340 5 49,717 104,928 72,110 121,825
150 1,944 73 0 851 127 172 721

Conclusion
• Long reads capable of producing Mate Pair
reads
– Quality mapping
– Tight distribution around insert size
• Human Application
– With longer insert sizes (40kb) could be used to
resolve structural variation
• Blog post coming soon:
– http://www.edgebio.com/blog/

Thanks
Edge Bio Team Follow Us:
• Lab • EdgeBio Twitter: @EdgeBio
– Joy Adigun • David Jenkins Twitter: @dfjenkins3
– Jennifer Sheffield • Justin Johnson Twitter: @BioInfo
– Ryan Mease
• djenkins@edgebio.com
– Rossio Kersey
• http://www.edgebio.com/blog/
• Informatics
– Anju Varadarajan
– Phil Dagosto
• Justin Johnson
• John Seed
• Dean Galaas

Ion Torrent Sequencing Applications: Variant Calling, Barcoding, and Long Range Mate Pairs

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Ion Torrent Sequencing Applications: Variant Calling, Barcoding, and Long Range Mate Pairs

Editor's Notes