SlideShare a Scribd company logo
1 of 28
Ion Torrent Sequencing Applications:
   Variant Calling, Barcoding, and Long
            Range Mate Pairs

                    David Jenkins
               Bioinformatics Engineer
                       EdgeBio
Contract Research Division
• Five SOLiD4 sequencing platforms
• One Life Techologies 5500XL
• Two Ion Torrent PGMs
• Automation thru Caliper Sciclone & Biomek FX
• Life Technologies Preferred Service Provider
• Agilent Certified Service Provider
• Commercial partnerships with companies such as CLCBio,
  DNANexus and Genologics
• MD/PhD & Masters Level Scientists and Bioinformaticians
• IT Infrastructure of >100 CPUs and >100TB storage
Agenda
• Germ Line Variant Caller
• Barcoding
• Long Range Mate Pair Data
Variant Calling
Variant Calling
• Goal: indentify SNPs
  and INDELs
   – High sensitivity
      • Few false negatives
   – High positive predictive
     value
      • Few false positives
• Challenge: distinguish
  between homopolymer
  sequencing error and
  true INDELs
Variant Calling
•   DH10B
•   All identified variants are false positives
•   PPV and sensitivity
•   maq fakemut used to insert artificial mutations
    – 220 SNPs and 239 INDELs
• EdgeBio 316 Chip Run
    – 11.00x AQ17 coverage of genome
• Goal: identify most sensitive (true pos./[true pos. +
  false neg.+) settings that don’t lose PPV (true pos./*true
  pos. + false pos.])
    – Identify the most variants while avoiding calling any non-
      variants
Samtools Defaults vs. Variant Calling
              Settings

• Default samtools setting not optimized for Ion
  Torrent error model
  – Lower base quality of candidates
  – Coverage from both strands
  – Strict requirements for homopolymers
     • two sequences from both strands
PPV                        Corrected Sensitivity
  Settings
                   Total     SNPs     INDELs     Total          SNPs         INDELs

 Samtools
  Default         6.014%    96.682%   3.203%     100%           100%             100%
  Settings

Q4, h100, o20,
 e27, m1, H1
                  39.672%    100%     25.060%   98.690%        99.550%      97.910%

Q14, h100, o20,
 e21, m1, H2
                  79.565%    100%     64.259%   92.810%        98.180%      89.870%
 Q7, h50, o10,
 e17, m4, H1
                  93.523%    100%     86.486%   91.720%        99.090%      84.940%
Q14, h50, o10,
 e17, m4, H1
                  95.148%    100%     89.655%   90.850%        98.180%      84.100%

  Variant
   Calling        95.676%    100%     90.533%   90.650%        99.550%      83.260%
  Settings
Q14, h50, o10,
 e17, m4, H2
                  97.175%    100%     93.631%   89.540%        96.360%      83.260%
PPV and Sensitivity of Samtools Analyses
100.000%




 80.000%




 60.000%




 40.000%                                                                                                 Total PPV

                                                                                                         SNPs PPV

                                                                                                         INDELs PPV

 20.000%                                                                                                 Total Corrected Sensitivity

                                                                                                         SNPs Corrected Sensitivity

                                                                                                         INDELs Corrected Sensitivity


  0.000%
           Default Samtools h100, o20, e27, m1, H1 o20, e21, m4, H2 o10, e17, m4, H1
                         Q4,                Q14, h75,        Q7, h50,         Q14, h50, o10, e17, m4, H1Variant CallingQ14, h50, o10, e17, m4, H2
Similar Results with Public DH10B Runs
                                      PPV and Sensitivity of Public DH10B Runs
100.00%




 80.00%




 60.00%


                                                                                                                        Total PPV

                                                                                                                        SNP PPV
 40.00%

                                                                                                                        INDEL PPV

                                                                                                                        Total Sensitivity
 20.00%
                                                                                                                        SNP Sensitivity

                                                                                                                        INDEL Sensitivity

  0.00%
          Life Ion Torrent 314   Life Ion Torrent   Life Ion Torrent 318   Life Ion Torrent      Edge Bio Ion   Life Ion Torrent 316 Life Ion Torrent
                 100MB            316LR DH10B                Chip           314LR DH10B       Torrent 316 DH10B        DH10B        316LR DH10B >99%
                                                                                                                                         accuracy
Homopolymer Mutated Reference Genome
                                                     Homopolymer PPV and Sensitivity
100.00%




 80.00%




 60.00%




                                                                                                                            Homopolymer PPV
 40.00%

                                                                                                                            Hompolymer Sensitivity


 20.00%




  0.00%
          Life Ion Torrent 314   Life Ion Torrent   Life Ion Torrent 318   Life Ion Torrent      Edge Bio Ion   Life Ion Torrent 316 Life Ion Torrent
                 100MB            316LR DH10B                Chip           314LR DH10B       Torrent 316 DH10B        DH10B        316LR DH10B >99%
                                                                                                                                     per base accuracy
                                                                                                                                      with long reads.
Conclusions
• Variant Calling plugin    • Important to remember
  able to identify >80%       Variant Calling is
  well-covered INDELs         Application Specific
  and >99% well-covered     • Easy to re-run Germ
  SNPs                        Line Variant Caller with
• Improves on                 custom settings.
  performance of default    • More information at
  samtools settings by        http://www.edgebio.com/blog/
  avoiding false positive
  SNPs and INDELs
Barcoding
Barcoding
• HuRef gDNA
• Compared read quality statistics with non-
  barcoded run
• IonSet barcodes 5-8
• 11bp barcodes at beginning of the read
Barcoding
• 94.51% reads mapped
  to barcodes used.
• Variant Calling Report
  for Each Barcode
   – New feature in 1.5.1
• Ion Community Feature
  Requests
   – Aligning barcodes to
     different references
   – Find out what
     community wants
Quality Comparison
Barcoded hg19 Run (TS 1.5.1)   Non-barcoded hg19 Run (TS 1.5.1)
Mapping Comparison
Conclusions
• Similar quality between     • 318 Chip and Barcoding
  barcoded and non            • Ion Torrent Community
  barcoded runs                 – Technical details
• Robust set of barcodes        – Desired Features
• Losing first 11 high          – Troubleshooting
  quality bases to the
  barcode
   – Explains lower initial
     quality
Long Range Mate Pairs
Long Range Mate Pairs
• Data provided by Ion Torrent
• Average 10KB inserts
• Split sff files with sff_extract utility
      •   >IA_A
      •   CTGCTGTACGGCCAAGGCGGATGTACGGTACAGCAG
      •   >IA_B
      •   CTGCTGTACCGTACATCCGCCTTGGCCGTACAGCAG
• Can reads map successfully with average 10KB
  inserts?
   – Increasing homopolymers farther into read
Unsplit Reads
Metric                  Mbp
Total Number of Bases   404.65
Q17 Bases               207.67
Q20 Bases               150.07
Total Number of Reads   2,308,396
Mean length [bp]        175
Longest Read [bp]       365




                                    From: http://flxlexblog.wordpress.com
Split Reads Metrics
                                          2000000


                                          1800000


                                          1600000
Type              Count       Percent
                                          1400000
 Total Reads              2,308,396
                                          1200000
Orphan Reads      220,707       9.561%
                                          1000000
 Partial Linker   106,913       4.631%
Multiple Linker      29         0.001%     800000


  Too Short        1,757        0.076%     600000


Correctly Split   1,978,990     85.730%    400000


                                           200000


                                                0
                                                    Orphan     Partial    Multiple  Too Short    Correctly
                                                    Reads (1   Linker      Linker               Split Reads
                                                      seq)     Found     Occuracnes
Reads 1
• Per base sequence quality below Q20 after base 20
• Analysis performed pre TS 1.5 release
   • Predicted base quality has improved
• Homopolymer enrichment relatively consistent across the read
Reads 2
• Per base sequence quality below Q20
• Second part of read in lower quality region of unsplit read
• Homopolymer enrichment still fairly uniform
Insert Size
        bwa                        tmap




μ = 10189.78, σ = 1282.43    μ = 9751.20, σ = 2016.62
Mapping
                                                   AQ17          AQ20        Perfect
                           Total Number of
                             Bases [Mbp]
                                                   218.55       179.37       170.28
                          Mean Length [bp]          70              63           60
                          Longest Alignment
                                [bp]
                                                    173          171           167
                           Mean Coverage
                              Depth
                                                   46.6x         38.3x        36.3x
                            Percentage of
                           Library Covered
                                                  99.99%        99.99%       99.99%

   Read                                                                                                >= 2
                Reads     Unmapped      Excluded          Clipped         Perfect      1 mismatch
Length [bp]                                                                                         mismatches

   50         3,240,310     15,981            20            0        1,810,959         744,229      669,121
   100        349,925       1,340             5          49,717          104,928        72,110      121,825
   150         1,944          73              0            851             127           172           721
Conclusion
• Long reads capable of producing Mate Pair
  reads
   – Quality mapping
   – Tight distribution around insert size
• Human Application
   – With longer insert sizes (40kb) could be used to
     resolve structural variation
• Blog post coming soon:
   – http://www.edgebio.com/blog/
Thanks
Edge Bio Team                      Follow Us:
• Lab                          •   EdgeBio Twitter: @EdgeBio
   –    Joy Adigun             •   David Jenkins Twitter: @dfjenkins3
   –    Jennifer Sheffield     •   Justin Johnson Twitter: @BioInfo
   –    Ryan Mease
                               •   djenkins@edgebio.com
   –    Rossio Kersey
                               •   http://www.edgebio.com/blog/
• Informatics
   – Anju Varadarajan
   – Phil Dagosto
• Justin Johnson
• John Seed
• Dean Galaas

More Related Content

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 

Ion Torrent Sequencing Applications: Variant Calling, Barcoding, and Long Range Mate Pairs

  • 1. Ion Torrent Sequencing Applications: Variant Calling, Barcoding, and Long Range Mate Pairs David Jenkins Bioinformatics Engineer EdgeBio
  • 2. Contract Research Division • Five SOLiD4 sequencing platforms • One Life Techologies 5500XL • Two Ion Torrent PGMs • Automation thru Caliper Sciclone & Biomek FX • Life Technologies Preferred Service Provider • Agilent Certified Service Provider • Commercial partnerships with companies such as CLCBio, DNANexus and Genologics • MD/PhD & Masters Level Scientists and Bioinformaticians • IT Infrastructure of >100 CPUs and >100TB storage
  • 3. Agenda • Germ Line Variant Caller • Barcoding • Long Range Mate Pair Data
  • 5. Variant Calling • Goal: indentify SNPs and INDELs – High sensitivity • Few false negatives – High positive predictive value • Few false positives • Challenge: distinguish between homopolymer sequencing error and true INDELs
  • 6. Variant Calling • DH10B • All identified variants are false positives • PPV and sensitivity • maq fakemut used to insert artificial mutations – 220 SNPs and 239 INDELs • EdgeBio 316 Chip Run – 11.00x AQ17 coverage of genome • Goal: identify most sensitive (true pos./[true pos. + false neg.+) settings that don’t lose PPV (true pos./*true pos. + false pos.]) – Identify the most variants while avoiding calling any non- variants
  • 7. Samtools Defaults vs. Variant Calling Settings • Default samtools setting not optimized for Ion Torrent error model – Lower base quality of candidates – Coverage from both strands – Strict requirements for homopolymers • two sequences from both strands
  • 8. PPV Corrected Sensitivity Settings Total SNPs INDELs Total SNPs INDELs Samtools Default 6.014% 96.682% 3.203% 100% 100% 100% Settings Q4, h100, o20, e27, m1, H1 39.672% 100% 25.060% 98.690% 99.550% 97.910% Q14, h100, o20, e21, m1, H2 79.565% 100% 64.259% 92.810% 98.180% 89.870% Q7, h50, o10, e17, m4, H1 93.523% 100% 86.486% 91.720% 99.090% 84.940% Q14, h50, o10, e17, m4, H1 95.148% 100% 89.655% 90.850% 98.180% 84.100% Variant Calling 95.676% 100% 90.533% 90.650% 99.550% 83.260% Settings Q14, h50, o10, e17, m4, H2 97.175% 100% 93.631% 89.540% 96.360% 83.260%
  • 9. PPV and Sensitivity of Samtools Analyses 100.000% 80.000% 60.000% 40.000% Total PPV SNPs PPV INDELs PPV 20.000% Total Corrected Sensitivity SNPs Corrected Sensitivity INDELs Corrected Sensitivity 0.000% Default Samtools h100, o20, e27, m1, H1 o20, e21, m4, H2 o10, e17, m4, H1 Q4, Q14, h75, Q7, h50, Q14, h50, o10, e17, m4, H1Variant CallingQ14, h50, o10, e17, m4, H2
  • 10. Similar Results with Public DH10B Runs PPV and Sensitivity of Public DH10B Runs 100.00% 80.00% 60.00% Total PPV SNP PPV 40.00% INDEL PPV Total Sensitivity 20.00% SNP Sensitivity INDEL Sensitivity 0.00% Life Ion Torrent 314 Life Ion Torrent Life Ion Torrent 318 Life Ion Torrent Edge Bio Ion Life Ion Torrent 316 Life Ion Torrent 100MB 316LR DH10B Chip 314LR DH10B Torrent 316 DH10B DH10B 316LR DH10B >99% accuracy
  • 11. Homopolymer Mutated Reference Genome Homopolymer PPV and Sensitivity 100.00% 80.00% 60.00% Homopolymer PPV 40.00% Hompolymer Sensitivity 20.00% 0.00% Life Ion Torrent 314 Life Ion Torrent Life Ion Torrent 318 Life Ion Torrent Edge Bio Ion Life Ion Torrent 316 Life Ion Torrent 100MB 316LR DH10B Chip 314LR DH10B Torrent 316 DH10B DH10B 316LR DH10B >99% per base accuracy with long reads.
  • 12. Conclusions • Variant Calling plugin • Important to remember able to identify >80% Variant Calling is well-covered INDELs Application Specific and >99% well-covered • Easy to re-run Germ SNPs Line Variant Caller with • Improves on custom settings. performance of default • More information at samtools settings by http://www.edgebio.com/blog/ avoiding false positive SNPs and INDELs
  • 14. Barcoding • HuRef gDNA • Compared read quality statistics with non- barcoded run • IonSet barcodes 5-8 • 11bp barcodes at beginning of the read
  • 15. Barcoding • 94.51% reads mapped to barcodes used. • Variant Calling Report for Each Barcode – New feature in 1.5.1 • Ion Community Feature Requests – Aligning barcodes to different references – Find out what community wants
  • 16. Quality Comparison Barcoded hg19 Run (TS 1.5.1) Non-barcoded hg19 Run (TS 1.5.1)
  • 18. Conclusions • Similar quality between • 318 Chip and Barcoding barcoded and non • Ion Torrent Community barcoded runs – Technical details • Robust set of barcodes – Desired Features • Losing first 11 high – Troubleshooting quality bases to the barcode – Explains lower initial quality
  • 20. Long Range Mate Pairs • Data provided by Ion Torrent • Average 10KB inserts • Split sff files with sff_extract utility • >IA_A • CTGCTGTACGGCCAAGGCGGATGTACGGTACAGCAG • >IA_B • CTGCTGTACCGTACATCCGCCTTGGCCGTACAGCAG • Can reads map successfully with average 10KB inserts? – Increasing homopolymers farther into read
  • 21. Unsplit Reads Metric Mbp Total Number of Bases 404.65 Q17 Bases 207.67 Q20 Bases 150.07 Total Number of Reads 2,308,396 Mean length [bp] 175 Longest Read [bp] 365 From: http://flxlexblog.wordpress.com
  • 22. Split Reads Metrics 2000000 1800000 1600000 Type Count Percent 1400000 Total Reads 2,308,396 1200000 Orphan Reads 220,707 9.561% 1000000 Partial Linker 106,913 4.631% Multiple Linker 29 0.001% 800000 Too Short 1,757 0.076% 600000 Correctly Split 1,978,990 85.730% 400000 200000 0 Orphan Partial Multiple Too Short Correctly Reads (1 Linker Linker Split Reads seq) Found Occuracnes
  • 23. Reads 1 • Per base sequence quality below Q20 after base 20 • Analysis performed pre TS 1.5 release • Predicted base quality has improved • Homopolymer enrichment relatively consistent across the read
  • 24. Reads 2 • Per base sequence quality below Q20 • Second part of read in lower quality region of unsplit read • Homopolymer enrichment still fairly uniform
  • 25. Insert Size bwa tmap μ = 10189.78, σ = 1282.43 μ = 9751.20, σ = 2016.62
  • 26. Mapping AQ17 AQ20 Perfect Total Number of Bases [Mbp] 218.55 179.37 170.28 Mean Length [bp] 70 63 60 Longest Alignment [bp] 173 171 167 Mean Coverage Depth 46.6x 38.3x 36.3x Percentage of Library Covered 99.99% 99.99% 99.99% Read >= 2 Reads Unmapped Excluded Clipped Perfect 1 mismatch Length [bp] mismatches 50 3,240,310 15,981 20 0 1,810,959 744,229 669,121 100 349,925 1,340 5 49,717 104,928 72,110 121,825 150 1,944 73 0 851 127 172 721
  • 27. Conclusion • Long reads capable of producing Mate Pair reads – Quality mapping – Tight distribution around insert size • Human Application – With longer insert sizes (40kb) could be used to resolve structural variation • Blog post coming soon: – http://www.edgebio.com/blog/
  • 28. Thanks Edge Bio Team Follow Us: • Lab • EdgeBio Twitter: @EdgeBio – Joy Adigun • David Jenkins Twitter: @dfjenkins3 – Jennifer Sheffield • Justin Johnson Twitter: @BioInfo – Ryan Mease • djenkins@edgebio.com – Rossio Kersey • http://www.edgebio.com/blog/ • Informatics – Anju Varadarajan – Phil Dagosto • Justin Johnson • John Seed • Dean Galaas

Editor's Notes

  1. Introduction, work on Ion Torrent data and try to stay on top of all Ion Torrent newly released public datasets. I write blog posts for the EdgeBio website and we’ve recently posted about the Germ Line variant caller in the Ion Torrent pipeline.
  2. With new high throughput sequencing techniques, lower costs, and faster turn around time we see broader applications for high throughput sequencing dataAim to provide sequencing as a service to allow anyone access to quality data and analysis in a broad range of applicationsFoundations of our service is Ion TorrentGoing to talk about three applications of ion torrent sequencing data
  3. Talk a bit about three applications that we have been analyzing at Edge and why we think they’re important.
  4. Why is this important?Identification of variants is one of the foundations of resequencing projects from amplicons to whole genome sequencing we want to use Ion Torrent to quickly and accurately identify variants in our data setsChallenging problem for each sequencing platform for different reasonsLooking for the ‘best’ solution for variant calling
  5. Relatively recent and welcome addition to the Ion Torrent pipelineGoal: identify SNPs and INDELs with high sensitivity (find all of the snps that are actually there) and high positive predictive value (avoid calling any false positive snps).Picture of a snp seen with samtools tview. Line at the top is the reference sequence and all of the dots and commas are agreement with the reference. At one position the reference sequence is a T but the base mapping to that position is a ‘C’.Ion Torrent does struggle with the challege of homopolymer sequencing error and the challenge in the variant calling plugin is to distinguish between homopolymer sequence errors and true indels in the sample. How can we test variant callers ability to identify true variants and avoid false variant calls
  6. Use DH10BWell validated sequence so any variants that your program identifies during variant calling indicate false positive variantsFine-tune your settings until you are able to minimize the number of false positive variants identifiedWhat about identifying all of the real variantsIf you aren’t careful enough in your consideration of positions, you may inadvertently throw out true variantsSo the best way to prove to ourselves that the variant caller is working give the variant caller some variants to identify and see how many it can find without finding any variants that you didn’t add. To do this, we used the maqfakemut utility to introduce some fake SNPs and INDELs into the E. coli DH10B genome.We then took a local resequencing run of E. Coli from a 316 chip with about 11 times AQ17 coverage.This analysis was done with Torrent Suite 1.4.1 so if it were repeated with the new Torrent Suite software we may see some improved results.Goal: identify most sensitive settings that don’t lose PPV1) True positive variants – inserted variants that were identified in the experiment2) False positive variants – variants identified in the experiment that were not inserted into the genome3) False negative variants – inserted variants that were ‘missed’ by the variant callerWith these numbers its possible to calculate sensitivity and PPV for the run
  7. The variant calling plugin does not use samtools default settings. They are tuned to a different error model where SNPs are the main source of sequencing error. In Ion torrent data where the main mode of sequencing error is INDELs, we need to slightly tweak the variant calling settings.Ion Torrent does this by lowering base quality for a position that is a potential candidate to be a variant.It also requires coverage from both strands to call an indel.Homopolymer INDELs are more likely to be from sequencing error than actually being an INDEL in Ion Torrent data. Homopolymerindels are dealt with by requiring at least two reads covering a homopolymerindel from both directions in order for it to be a candidate indel.
  8. Greyed out some of the runsProcess was iterative, we tried many different samtools settings to try to identify the ‘best’ settingsTwo major data points are the default samtools settings and the variant calling settingsDefault samtools identifies all of the well-covered snps and indelsAt the cost of identifying many false positivesVariant calling settings remove the majority of these false positive callsAt the cost of identifying some of the indels
  9. LOOK AT THE SNPs. So good all the timeWhere it struggles is with indels.Trade off between complete view of true positivesAnd having to weed through many false positives
  10. Public datasets show a similar distribution for variant calling dataWith newer technologies and higher accuracy PPV and sensitivity increaseAll runs are able to identify SNPs with high PPV and sensitivityReal challenge for Ion Torrent data, homopolymer errors
  11. Similar to the performance of 1 base pair indels.High accuracy dataset able to identify the majority of the homopolymerindels without many false positives
  12. Application specific, if you can tolerate losing a small portion of true snps in your analysis, it may be worth not searching through a lot of false positives to get to the actual variants.It’s easy to re-run the Germ line variant caller with your own settings right from the run reportMore information about our variant calling analysis can be found on our blog edgebio.com/blog
  13. Why is this important?With higher throughput, it is possible to run many samples per chipTotal cost per sample will decreaseNeed robust barcodes that effectively separate sequencesDoes this affect quality?
  14. Tested this with HuRefgDNACould use DH10B but we prefer a more real applicationUsed a subset of the IonSet Barcodes11bp barcodes
  15. About 2,000,000 reads with 11bp in their highest quality regionsLosing about 22MB of your highest quality data to the barcodeExplains the lower qualityDoes this affect mapping?
  16. Slightly decreased mapping of barcoded readsSlightly reduced number of perfect reads in barcoded sample, but still a vast majority of the reads are mapping and almost 50% of the mapping reads are mapping perfectly to the genome
  17. Why is this important?-several applicationsBacterial denovo assemblyStructual variationHaplotype phasing