Ion Torrent Sequencing Applications: Variant Calling, Barcoding, and Long Range Mate Pairs


Published on

EdgeBio discusses three applications for Ion Torrent sequencing that we have been exploring lately. We discuss the robustness of the included Germ Line Variant Caller, the barcoding capability on the Ion Torrent, and a new dataset of Long Range (10kb) Mate Pairs.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Introduction, work on Ion Torrent data and try to stay on top of all Ion Torrent newly released public datasets. I write blog posts for the EdgeBio website and we’ve recently posted about the Germ Line variant caller in the Ion Torrent pipeline.
  • With new high throughput sequencing techniques, lower costs, and faster turn around time we see broader applications for high throughput sequencing dataAim to provide sequencing as a service to allow anyone access to quality data and analysis in a broad range of applicationsFoundations of our service is Ion TorrentGoing to talk about three applications of ion torrent sequencing data
  • Talk a bit about three applications that we have been analyzing at Edge and why we think they’re important.
  • Why is this important?Identification of variants is one of the foundations of resequencing projects from amplicons to whole genome sequencing we want to use Ion Torrent to quickly and accurately identify variants in our data setsChallenging problem for each sequencing platform for different reasonsLooking for the ‘best’ solution for variant calling
  • Relatively recent and welcome addition to the Ion Torrent pipelineGoal: identify SNPs and INDELs with high sensitivity (find all of the snps that are actually there) and high positive predictive value (avoid calling any false positive snps).Picture of a snp seen with samtools tview. Line at the top is the reference sequence and all of the dots and commas are agreement with the reference. At one position the reference sequence is a T but the base mapping to that position is a ‘C’.Ion Torrent does struggle with the challege of homopolymer sequencing error and the challenge in the variant calling plugin is to distinguish between homopolymer sequence errors and true indels in the sample. How can we test variant callers ability to identify true variants and avoid false variant calls
  • Use DH10BWell validated sequence so any variants that your program identifies during variant calling indicate false positive variantsFine-tune your settings until you are able to minimize the number of false positive variants identifiedWhat about identifying all of the real variantsIf you aren’t careful enough in your consideration of positions, you may inadvertently throw out true variantsSo the best way to prove to ourselves that the variant caller is working give the variant caller some variants to identify and see how many it can find without finding any variants that you didn’t add. To do this, we used the maqfakemut utility to introduce some fake SNPs and INDELs into the E. coli DH10B genome.We then took a local resequencing run of E. Coli from a 316 chip with about 11 times AQ17 coverage.This analysis was done with Torrent Suite 1.4.1 so if it were repeated with the new Torrent Suite software we may see some improved results.Goal: identify most sensitive settings that don’t lose PPV1) True positive variants – inserted variants that were identified in the experiment2) False positive variants – variants identified in the experiment that were not inserted into the genome3) False negative variants – inserted variants that were ‘missed’ by the variant callerWith these numbers its possible to calculate sensitivity and PPV for the run
  • The variant calling plugin does not use samtools default settings. They are tuned to a different error model where SNPs are the main source of sequencing error. In Ion torrent data where the main mode of sequencing error is INDELs, we need to slightly tweak the variant calling settings.Ion Torrent does this by lowering base quality for a position that is a potential candidate to be a variant.It also requires coverage from both strands to call an indel.Homopolymer INDELs are more likely to be from sequencing error than actually being an INDEL in Ion Torrent data. Homopolymerindels are dealt with by requiring at least two reads covering a homopolymerindel from both directions in order for it to be a candidate indel.
  • Greyed out some of the runsProcess was iterative, we tried many different samtools settings to try to identify the ‘best’ settingsTwo major data points are the default samtools settings and the variant calling settingsDefault samtools identifies all of the well-covered snps and indelsAt the cost of identifying many false positivesVariant calling settings remove the majority of these false positive callsAt the cost of identifying some of the indels
  • LOOK AT THE SNPs. So good all the timeWhere it struggles is with indels.Trade off between complete view of true positivesAnd having to weed through many false positives
  • Public datasets show a similar distribution for variant calling dataWith newer technologies and higher accuracy PPV and sensitivity increaseAll runs are able to identify SNPs with high PPV and sensitivityReal challenge for Ion Torrent data, homopolymer errors
  • Similar to the performance of 1 base pair indels.High accuracy dataset able to identify the majority of the homopolymerindels without many false positives
  • Application specific, if you can tolerate losing a small portion of true snps in your analysis, it may be worth not searching through a lot of false positives to get to the actual variants.It’s easy to re-run the Germ line variant caller with your own settings right from the run reportMore information about our variant calling analysis can be found on our blog
  • Why is this important?With higher throughput, it is possible to run many samples per chipTotal cost per sample will decreaseNeed robust barcodes that effectively separate sequencesDoes this affect quality?
  • Tested this with HuRefgDNACould use DH10B but we prefer a more real applicationUsed a subset of the IonSet Barcodes11bp barcodes
  • About 2,000,000 reads with 11bp in their highest quality regionsLosing about 22MB of your highest quality data to the barcodeExplains the lower qualityDoes this affect mapping?
  • Slightly decreased mapping of barcoded readsSlightly reduced number of perfect reads in barcoded sample, but still a vast majority of the reads are mapping and almost 50% of the mapping reads are mapping perfectly to the genome
  • Why is this important?-several applicationsBacterial denovo assemblyStructual variationHaplotype phasing
  • Ion Torrent Sequencing Applications: Variant Calling, Barcoding, and Long Range Mate Pairs

    1. 1. Ion Torrent Sequencing Applications: Variant Calling, Barcoding, and Long Range Mate Pairs David Jenkins Bioinformatics Engineer EdgeBio
    2. 2. Contract Research Division• Five SOLiD4 sequencing platforms• One Life Techologies 5500XL• Two Ion Torrent PGMs• Automation thru Caliper Sciclone & Biomek FX• Life Technologies Preferred Service Provider• Agilent Certified Service Provider• Commercial partnerships with companies such as CLCBio, DNANexus and Genologics• MD/PhD & Masters Level Scientists and Bioinformaticians• IT Infrastructure of >100 CPUs and >100TB storage
    3. 3. Agenda• Germ Line Variant Caller• Barcoding• Long Range Mate Pair Data
    4. 4. Variant Calling
    5. 5. Variant Calling• Goal: indentify SNPs and INDELs – High sensitivity • Few false negatives – High positive predictive value • Few false positives• Challenge: distinguish between homopolymer sequencing error and true INDELs
    6. 6. Variant Calling• DH10B• All identified variants are false positives• PPV and sensitivity• maq fakemut used to insert artificial mutations – 220 SNPs and 239 INDELs• EdgeBio 316 Chip Run – 11.00x AQ17 coverage of genome• Goal: identify most sensitive (true pos./[true pos. + false neg.+) settings that don’t lose PPV (true pos./*true pos. + false pos.]) – Identify the most variants while avoiding calling any non- variants
    7. 7. Samtools Defaults vs. Variant Calling Settings• Default samtools setting not optimized for Ion Torrent error model – Lower base quality of candidates – Coverage from both strands – Strict requirements for homopolymers • two sequences from both strands
    8. 8. PPV Corrected Sensitivity Settings Total SNPs INDELs Total SNPs INDELs Samtools Default 6.014% 96.682% 3.203% 100% 100% 100% SettingsQ4, h100, o20, e27, m1, H1 39.672% 100% 25.060% 98.690% 99.550% 97.910%Q14, h100, o20, e21, m1, H2 79.565% 100% 64.259% 92.810% 98.180% 89.870% Q7, h50, o10, e17, m4, H1 93.523% 100% 86.486% 91.720% 99.090% 84.940%Q14, h50, o10, e17, m4, H1 95.148% 100% 89.655% 90.850% 98.180% 84.100% Variant Calling 95.676% 100% 90.533% 90.650% 99.550% 83.260% SettingsQ14, h50, o10, e17, m4, H2 97.175% 100% 93.631% 89.540% 96.360% 83.260%
    9. 9. PPV and Sensitivity of Samtools Analyses100.000% 80.000% 60.000% 40.000% Total PPV SNPs PPV INDELs PPV 20.000% Total Corrected Sensitivity SNPs Corrected Sensitivity INDELs Corrected Sensitivity 0.000% Default Samtools h100, o20, e27, m1, H1 o20, e21, m4, H2 o10, e17, m4, H1 Q4, Q14, h75, Q7, h50, Q14, h50, o10, e17, m4, H1Variant CallingQ14, h50, o10, e17, m4, H2
    10. 10. Similar Results with Public DH10B Runs PPV and Sensitivity of Public DH10B Runs100.00% 80.00% 60.00% Total PPV SNP PPV 40.00% INDEL PPV Total Sensitivity 20.00% SNP Sensitivity INDEL Sensitivity 0.00% Life Ion Torrent 314 Life Ion Torrent Life Ion Torrent 318 Life Ion Torrent Edge Bio Ion Life Ion Torrent 316 Life Ion Torrent 100MB 316LR DH10B Chip 314LR DH10B Torrent 316 DH10B DH10B 316LR DH10B >99% accuracy
    11. 11. Homopolymer Mutated Reference Genome Homopolymer PPV and Sensitivity100.00% 80.00% 60.00% Homopolymer PPV 40.00% Hompolymer Sensitivity 20.00% 0.00% Life Ion Torrent 314 Life Ion Torrent Life Ion Torrent 318 Life Ion Torrent Edge Bio Ion Life Ion Torrent 316 Life Ion Torrent 100MB 316LR DH10B Chip 314LR DH10B Torrent 316 DH10B DH10B 316LR DH10B >99% per base accuracy with long reads.
    12. 12. Conclusions• Variant Calling plugin • Important to remember able to identify >80% Variant Calling is well-covered INDELs Application Specific and >99% well-covered • Easy to re-run Germ SNPs Line Variant Caller with• Improves on custom settings. performance of default • More information at samtools settings by avoiding false positive SNPs and INDELs
    13. 13. Barcoding
    14. 14. Barcoding• HuRef gDNA• Compared read quality statistics with non- barcoded run• IonSet barcodes 5-8• 11bp barcodes at beginning of the read
    15. 15. Barcoding• 94.51% reads mapped to barcodes used.• Variant Calling Report for Each Barcode – New feature in 1.5.1• Ion Community Feature Requests – Aligning barcodes to different references – Find out what community wants
    16. 16. Quality ComparisonBarcoded hg19 Run (TS 1.5.1) Non-barcoded hg19 Run (TS 1.5.1)
    17. 17. Mapping Comparison
    18. 18. Conclusions• Similar quality between • 318 Chip and Barcoding barcoded and non • Ion Torrent Community barcoded runs – Technical details• Robust set of barcodes – Desired Features• Losing first 11 high – Troubleshooting quality bases to the barcode – Explains lower initial quality
    19. 19. Long Range Mate Pairs
    20. 20. Long Range Mate Pairs• Data provided by Ion Torrent• Average 10KB inserts• Split sff files with sff_extract utility • >IA_A • CTGCTGTACGGCCAAGGCGGATGTACGGTACAGCAG • >IA_B • CTGCTGTACCGTACATCCGCCTTGGCCGTACAGCAG• Can reads map successfully with average 10KB inserts? – Increasing homopolymers farther into read
    21. 21. Unsplit ReadsMetric MbpTotal Number of Bases 404.65Q17 Bases 207.67Q20 Bases 150.07Total Number of Reads 2,308,396Mean length [bp] 175Longest Read [bp] 365 From:
    22. 22. Split Reads Metrics 2000000 1800000 1600000Type Count Percent 1400000 Total Reads 2,308,396 1200000Orphan Reads 220,707 9.561% 1000000 Partial Linker 106,913 4.631%Multiple Linker 29 0.001% 800000 Too Short 1,757 0.076% 600000Correctly Split 1,978,990 85.730% 400000 200000 0 Orphan Partial Multiple Too Short Correctly Reads (1 Linker Linker Split Reads seq) Found Occuracnes
    23. 23. Reads 1• Per base sequence quality below Q20 after base 20• Analysis performed pre TS 1.5 release • Predicted base quality has improved• Homopolymer enrichment relatively consistent across the read
    24. 24. Reads 2• Per base sequence quality below Q20• Second part of read in lower quality region of unsplit read• Homopolymer enrichment still fairly uniform
    25. 25. Insert Size bwa tmapμ = 10189.78, σ = 1282.43 μ = 9751.20, σ = 2016.62
    26. 26. Mapping AQ17 AQ20 Perfect Total Number of Bases [Mbp] 218.55 179.37 170.28 Mean Length [bp] 70 63 60 Longest Alignment [bp] 173 171 167 Mean Coverage Depth 46.6x 38.3x 36.3x Percentage of Library Covered 99.99% 99.99% 99.99% Read >= 2 Reads Unmapped Excluded Clipped Perfect 1 mismatchLength [bp] mismatches 50 3,240,310 15,981 20 0 1,810,959 744,229 669,121 100 349,925 1,340 5 49,717 104,928 72,110 121,825 150 1,944 73 0 851 127 172 721
    27. 27. Conclusion• Long reads capable of producing Mate Pair reads – Quality mapping – Tight distribution around insert size• Human Application – With longer insert sizes (40kb) could be used to resolve structural variation• Blog post coming soon: –
    28. 28. ThanksEdge Bio Team Follow Us:• Lab • EdgeBio Twitter: @EdgeBio – Joy Adigun • David Jenkins Twitter: @dfjenkins3 – Jennifer Sheffield • Justin Johnson Twitter: @BioInfo – Ryan Mease • – Rossio Kersey •• Informatics – Anju Varadarajan – Phil Dagosto• Justin Johnson• John Seed• Dean Galaas