EdgeBio discusses three applications for Ion Torrent sequencing that we have been exploring lately. We discuss the robustness of the included Germ Line Variant Caller, the barcoding capability on the Ion Torrent, and a new dataset of Long Range (10kb) Mate Pairs.
Ion Torrent Sequencing Applications: Variant Calling, Barcoding, and Long Range Mate Pairs
1. Ion Torrent Sequencing Applications:
Variant Calling, Barcoding, and Long
Range Mate Pairs
David Jenkins
Bioinformatics Engineer
EdgeBio
2. Contract Research Division
• Five SOLiD4 sequencing platforms
• One Life Techologies 5500XL
• Two Ion Torrent PGMs
• Automation thru Caliper Sciclone & Biomek FX
• Life Technologies Preferred Service Provider
• Agilent Certified Service Provider
• Commercial partnerships with companies such as CLCBio,
DNANexus and Genologics
• MD/PhD & Masters Level Scientists and Bioinformaticians
• IT Infrastructure of >100 CPUs and >100TB storage
3. Agenda
• Germ Line Variant Caller
• Barcoding
• Long Range Mate Pair Data
5. Variant Calling
• Goal: indentify SNPs
and INDELs
– High sensitivity
• Few false negatives
– High positive predictive
value
• Few false positives
• Challenge: distinguish
between homopolymer
sequencing error and
true INDELs
6. Variant Calling
• DH10B
• All identified variants are false positives
• PPV and sensitivity
• maq fakemut used to insert artificial mutations
– 220 SNPs and 239 INDELs
• EdgeBio 316 Chip Run
– 11.00x AQ17 coverage of genome
• Goal: identify most sensitive (true pos./[true pos. +
false neg.+) settings that don’t lose PPV (true pos./*true
pos. + false pos.])
– Identify the most variants while avoiding calling any non-
variants
7. Samtools Defaults vs. Variant Calling
Settings
• Default samtools setting not optimized for Ion
Torrent error model
– Lower base quality of candidates
– Coverage from both strands
– Strict requirements for homopolymers
• two sequences from both strands
10. Similar Results with Public DH10B Runs
PPV and Sensitivity of Public DH10B Runs
100.00%
80.00%
60.00%
Total PPV
SNP PPV
40.00%
INDEL PPV
Total Sensitivity
20.00%
SNP Sensitivity
INDEL Sensitivity
0.00%
Life Ion Torrent 314 Life Ion Torrent Life Ion Torrent 318 Life Ion Torrent Edge Bio Ion Life Ion Torrent 316 Life Ion Torrent
100MB 316LR DH10B Chip 314LR DH10B Torrent 316 DH10B DH10B 316LR DH10B >99%
accuracy
11. Homopolymer Mutated Reference Genome
Homopolymer PPV and Sensitivity
100.00%
80.00%
60.00%
Homopolymer PPV
40.00%
Hompolymer Sensitivity
20.00%
0.00%
Life Ion Torrent 314 Life Ion Torrent Life Ion Torrent 318 Life Ion Torrent Edge Bio Ion Life Ion Torrent 316 Life Ion Torrent
100MB 316LR DH10B Chip 314LR DH10B Torrent 316 DH10B DH10B 316LR DH10B >99%
per base accuracy
with long reads.
12. Conclusions
• Variant Calling plugin • Important to remember
able to identify >80% Variant Calling is
well-covered INDELs Application Specific
and >99% well-covered • Easy to re-run Germ
SNPs Line Variant Caller with
• Improves on custom settings.
performance of default • More information at
samtools settings by http://www.edgebio.com/blog/
avoiding false positive
SNPs and INDELs
14. Barcoding
• HuRef gDNA
• Compared read quality statistics with non-
barcoded run
• IonSet barcodes 5-8
• 11bp barcodes at beginning of the read
15. Barcoding
• 94.51% reads mapped
to barcodes used.
• Variant Calling Report
for Each Barcode
– New feature in 1.5.1
• Ion Community Feature
Requests
– Aligning barcodes to
different references
– Find out what
community wants
18. Conclusions
• Similar quality between • 318 Chip and Barcoding
barcoded and non • Ion Torrent Community
barcoded runs – Technical details
• Robust set of barcodes – Desired Features
• Losing first 11 high – Troubleshooting
quality bases to the
barcode
– Explains lower initial
quality
20. Long Range Mate Pairs
• Data provided by Ion Torrent
• Average 10KB inserts
• Split sff files with sff_extract utility
• >IA_A
• CTGCTGTACGGCCAAGGCGGATGTACGGTACAGCAG
• >IA_B
• CTGCTGTACCGTACATCCGCCTTGGCCGTACAGCAG
• Can reads map successfully with average 10KB
inserts?
– Increasing homopolymers farther into read
21. Unsplit Reads
Metric Mbp
Total Number of Bases 404.65
Q17 Bases 207.67
Q20 Bases 150.07
Total Number of Reads 2,308,396
Mean length [bp] 175
Longest Read [bp] 365
From: http://flxlexblog.wordpress.com
22. Split Reads Metrics
2000000
1800000
1600000
Type Count Percent
1400000
Total Reads 2,308,396
1200000
Orphan Reads 220,707 9.561%
1000000
Partial Linker 106,913 4.631%
Multiple Linker 29 0.001% 800000
Too Short 1,757 0.076% 600000
Correctly Split 1,978,990 85.730% 400000
200000
0
Orphan Partial Multiple Too Short Correctly
Reads (1 Linker Linker Split Reads
seq) Found Occuracnes
23. Reads 1
• Per base sequence quality below Q20 after base 20
• Analysis performed pre TS 1.5 release
• Predicted base quality has improved
• Homopolymer enrichment relatively consistent across the read
24. Reads 2
• Per base sequence quality below Q20
• Second part of read in lower quality region of unsplit read
• Homopolymer enrichment still fairly uniform
27. Conclusion
• Long reads capable of producing Mate Pair
reads
– Quality mapping
– Tight distribution around insert size
• Human Application
– With longer insert sizes (40kb) could be used to
resolve structural variation
• Blog post coming soon:
– http://www.edgebio.com/blog/
28. Thanks
Edge Bio Team Follow Us:
• Lab • EdgeBio Twitter: @EdgeBio
– Joy Adigun • David Jenkins Twitter: @dfjenkins3
– Jennifer Sheffield • Justin Johnson Twitter: @BioInfo
– Ryan Mease
• djenkins@edgebio.com
– Rossio Kersey
• http://www.edgebio.com/blog/
• Informatics
– Anju Varadarajan
– Phil Dagosto
• Justin Johnson
• John Seed
• Dean Galaas
Editor's Notes
Introduction, work on Ion Torrent data and try to stay on top of all Ion Torrent newly released public datasets. I write blog posts for the EdgeBio website and we’ve recently posted about the Germ Line variant caller in the Ion Torrent pipeline.
With new high throughput sequencing techniques, lower costs, and faster turn around time we see broader applications for high throughput sequencing dataAim to provide sequencing as a service to allow anyone access to quality data and analysis in a broad range of applicationsFoundations of our service is Ion TorrentGoing to talk about three applications of ion torrent sequencing data
Talk a bit about three applications that we have been analyzing at Edge and why we think they’re important.
Why is this important?Identification of variants is one of the foundations of resequencing projects from amplicons to whole genome sequencing we want to use Ion Torrent to quickly and accurately identify variants in our data setsChallenging problem for each sequencing platform for different reasonsLooking for the ‘best’ solution for variant calling
Relatively recent and welcome addition to the Ion Torrent pipelineGoal: identify SNPs and INDELs with high sensitivity (find all of the snps that are actually there) and high positive predictive value (avoid calling any false positive snps).Picture of a snp seen with samtools tview. Line at the top is the reference sequence and all of the dots and commas are agreement with the reference. At one position the reference sequence is a T but the base mapping to that position is a ‘C’.Ion Torrent does struggle with the challege of homopolymer sequencing error and the challenge in the variant calling plugin is to distinguish between homopolymer sequence errors and true indels in the sample. How can we test variant callers ability to identify true variants and avoid false variant calls
Use DH10BWell validated sequence so any variants that your program identifies during variant calling indicate false positive variantsFine-tune your settings until you are able to minimize the number of false positive variants identifiedWhat about identifying all of the real variantsIf you aren’t careful enough in your consideration of positions, you may inadvertently throw out true variantsSo the best way to prove to ourselves that the variant caller is working give the variant caller some variants to identify and see how many it can find without finding any variants that you didn’t add. To do this, we used the maqfakemut utility to introduce some fake SNPs and INDELs into the E. coli DH10B genome.We then took a local resequencing run of E. Coli from a 316 chip with about 11 times AQ17 coverage.This analysis was done with Torrent Suite 1.4.1 so if it were repeated with the new Torrent Suite software we may see some improved results.Goal: identify most sensitive settings that don’t lose PPV1) True positive variants – inserted variants that were identified in the experiment2) False positive variants – variants identified in the experiment that were not inserted into the genome3) False negative variants – inserted variants that were ‘missed’ by the variant callerWith these numbers its possible to calculate sensitivity and PPV for the run
The variant calling plugin does not use samtools default settings. They are tuned to a different error model where SNPs are the main source of sequencing error. In Ion torrent data where the main mode of sequencing error is INDELs, we need to slightly tweak the variant calling settings.Ion Torrent does this by lowering base quality for a position that is a potential candidate to be a variant.It also requires coverage from both strands to call an indel.Homopolymer INDELs are more likely to be from sequencing error than actually being an INDEL in Ion Torrent data. Homopolymerindels are dealt with by requiring at least two reads covering a homopolymerindel from both directions in order for it to be a candidate indel.
Greyed out some of the runsProcess was iterative, we tried many different samtools settings to try to identify the ‘best’ settingsTwo major data points are the default samtools settings and the variant calling settingsDefault samtools identifies all of the well-covered snps and indelsAt the cost of identifying many false positivesVariant calling settings remove the majority of these false positive callsAt the cost of identifying some of the indels
LOOK AT THE SNPs. So good all the timeWhere it struggles is with indels.Trade off between complete view of true positivesAnd having to weed through many false positives
Public datasets show a similar distribution for variant calling dataWith newer technologies and higher accuracy PPV and sensitivity increaseAll runs are able to identify SNPs with high PPV and sensitivityReal challenge for Ion Torrent data, homopolymer errors
Similar to the performance of 1 base pair indels.High accuracy dataset able to identify the majority of the homopolymerindels without many false positives
Application specific, if you can tolerate losing a small portion of true snps in your analysis, it may be worth not searching through a lot of false positives to get to the actual variants.It’s easy to re-run the Germ line variant caller with your own settings right from the run reportMore information about our variant calling analysis can be found on our blog edgebio.com/blog
Why is this important?With higher throughput, it is possible to run many samples per chipTotal cost per sample will decreaseNeed robust barcodes that effectively separate sequencesDoes this affect quality?
Tested this with HuRefgDNACould use DH10B but we prefer a more real applicationUsed a subset of the IonSet Barcodes11bp barcodes
About 2,000,000 reads with 11bp in their highest quality regionsLosing about 22MB of your highest quality data to the barcodeExplains the lower qualityDoes this affect mapping?
Slightly decreased mapping of barcoded readsSlightly reduced number of perfect reads in barcoded sample, but still a vast majority of the reads are mapping and almost 50% of the mapping reads are mapping perfectly to the genome
Why is this important?-several applicationsBacterial denovo assemblyStructual variationHaplotype phasing