Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Avoiding Nonsense Results 
in your NGS Variant Studies 
James Lyons-Weiler, PhD 
Scientific Director/ 
Senior Research Sci...
Two Parts 
• Identifying sites with low genotypic signal 
increases concordance among variant callers 
• Hazards in findin...
23andMe: High risk of RA and psiriosis 
GTL: Low risk of RA and psiriosis
NYTimes Article, etc.
Data were from Illumina hi-seq 2000
Among method average 
Concordance 
57.5% overall; 
32.7% at high coverage 
O’Rawe et al.
Information Theory 
Consensus Analysis 
e.g.,2/3, ¾, set analysis 
(-> modeling) 
Improve Callers 
(fix errors, modeling) ...
Entropy of Base Distributions 
A T C G A T C G A T C G 
Low entropy 
Low entropy 
High entropy 
High enthalpy 
High enthal...
Boltzmann Entropy 
• s = k ln w (Planck) 
• w = antiln(s/k) 
http://schneider.ncifcrf.gov/images/boltzmann 
/boltzmann-tom...
Rank Sorted Distribution of w 
(O’Rawe et al. data) 
Heterozygotes w = 2 
Homozygotes w = 1
Example w Density Distribution
w and FBVC 
A T C G w pw Zygosity Genotype 
200 0 0 0 1 0 Homozygote AA 
16 158 13 13 2.102558 0 Homozygote TT 
100 100 0 ...
Operational* 
Equiprobable Null Distribution 
{f(A) = f(T) = f(G) = f(C)}
Convergence 
of significance (pw)
What We Expect 
INCREASED CONCORDANCE 
Genotypic Signal Filtering 
VARIANT/BASE CALLERS 
MAPPER 
SEQUENCER 
TRUTH (BIOLOGI...
Phom Function
gatk 
From the O’Rawe et al. generated results 
FBVC = frequency-based variant caller (Lyons-Weiler et al.) 
Concordance 
...
Signal Tx %Concordance 
FBVC_vs_FBVC Marked ALL 85.64 
pw<=0.05 91.08 
pw>0.05 35.66 
FBVC_vs_FBVC Realigned ALL 83.82 
pw...
Information Theory 
Consensus Analysis 
e.g.,2/3, ¾, set analysis 
(-> modeling) 
Improve Callers 
(fix errors, modeling) ...
Lifescope reads (read) 
Shrimp2 reads (blue) 
Mappers must be systematically evaluated
Part 2: Good and Bad News for 
RNASeq (and everything else): 
The Bad News: 
Fold Change is Biased. 
The Good News: 
We ha...
T-test is not appropriate 
for small N, large P data 
(such as RNASeq)
Fold Change > 2.0 
Delta > 25
FC(A/B) is Blind to Large Portions 
of Your Data 
FC(A/B) 
Delta 
(and J5: Patel & Lyons-Weiler, 2004)
Ratio are Hard to Interpret as 
Biological Differences 
Gene A B delta (A-B) FC(A/B) 
gene1 5 3 2 1.667 
gene2 50 30 20 1....
A-B is a difference 
A/B is a quotient.
Log2 Transformation 
Does not Help 
Reveals Minor Delta (&J5) Bias 
Pink = FC(A/B) 
Black = Delta
G-Thresholding J5
FC Bias in 
Amyotrophic Lateral Sclerosis 
350000 
300000 
250000 
200000 
150000 
100000 
50000 
0 
0 50000 100000 150000...
FC(A/B) Bias in 
Alchohol-Induced Hepatitis 
Black circles = FC(A/B). Pink = Gthr-J5 genes
Conclusions 
• Not all NGS/HTS sites have sufficient genotypic signal to warrant a 
base call. High coverage alone does no...
Credits and Contact 
• pw, pHom, etc: James Lyons-Weiler, Alan Twaddle, Rahil Sethi. 
– (MS in preparation) 
– Our softwar...
Avoiding Nonsense Results in your NGS Variant Studies
Avoiding Nonsense Results in your NGS Variant Studies
Avoiding Nonsense Results in your NGS Variant Studies
Avoiding Nonsense Results in your NGS Variant Studies
Upcoming SlideShare
Loading in …5
×

Avoiding Nonsense Results in your NGS Variant Studies

619 views

Published on

Presented at the 2014 Bio-IT World Expo in Boston, this slideshow provides info on the use of Lyons-Weiler's entropy-based measures of genotypic signal to improve concordance among alternative variant calling algorithms and to evaluate various steps in the GATK best practices pipeline. The second part of the talk presented data showing a demarcation bias in the widely used measure of fold change in selected differentially expressed genes, transcripts or proteins from microarray and RNASeq data.

http://www.bio-itworldexpo.com/Next-Gen-Sequencing-Informatics/

Published in: Science
  • Be the first to comment

Avoiding Nonsense Results in your NGS Variant Studies

  1. 1. Avoiding Nonsense Results in your NGS Variant Studies James Lyons-Weiler, PhD Scientific Director/ Senior Research Scientist Bioinformatics Analysis Core Genomics & Proteomics Core Laboratories University of Pittsburgh Pittsburgh, PA May 1, 2014
  2. 2. Two Parts • Identifying sites with low genotypic signal increases concordance among variant callers • Hazards in finding differentially expressed genes in RNASeq – how to do it more robustly.
  3. 3. 23andMe: High risk of RA and psiriosis GTL: Low risk of RA and psiriosis
  4. 4. NYTimes Article, etc.
  5. 5. Data were from Illumina hi-seq 2000
  6. 6. Among method average Concordance 57.5% overall; 32.7% at high coverage O’Rawe et al.
  7. 7. Information Theory Consensus Analysis e.g.,2/3, ¾, set analysis (-> modeling) Improve Callers (fix errors, modeling) Bake Offs LOW CONCORDANCE (O’Rawe et al., 2013) VARIANT CALLERS MAPPER SEQUENCER TRUTH (BIOLOGICAL MOLECULAR SEQUENCE) Simulations Spiked Ins
  8. 8. Entropy of Base Distributions A T C G A T C G A T C G Low entropy Low entropy High entropy High enthalpy High enthalpy Low enthalpy
  9. 9. Boltzmann Entropy • s = k ln w (Planck) • w = antiln(s/k) http://schneider.ncifcrf.gov/images/boltzmann /boltzmann-tomb-4.html
  10. 10. Rank Sorted Distribution of w (O’Rawe et al. data) Heterozygotes w = 2 Homozygotes w = 1
  11. 11. Example w Density Distribution
  12. 12. w and FBVC A T C G w pw Zygosity Genotype 200 0 0 0 1 0 Homozygote AA 16 158 13 13 2.102558 0 Homozygote TT 100 100 0 0 2 0 Heterozygote AT 58 30 1 111 2.768507 0 Heterozygote AG 28 80 14 78 3.303636 0 Heterozygote TG 76 38 29 57 3.758733 0 Heterozygote AG 33 49 60 58 3.895496 0.0126 Heterzygote? CG? 50 50 50 50 4 1 noise unknown
  13. 13. Operational* Equiprobable Null Distribution {f(A) = f(T) = f(G) = f(C)}
  14. 14. Convergence of significance (pw)
  15. 15. What We Expect INCREASED CONCORDANCE Genotypic Signal Filtering VARIANT/BASE CALLERS MAPPER SEQUENCER TRUTH (BIOLOGICAL MOLECULAR SEQUENCE)
  16. 16. Phom Function
  17. 17. gatk From the O’Rawe et al. generated results FBVC = frequency-based variant caller (Lyons-Weiler et al.) Concordance w/ FBVC Hom Het ALL 0.5762 11868 17670 pw<=0.05 0.9976 11282 5676 pw>0.05 0.0074 586 11994 samtools ALL 0.5649 11541 18799 pw<=0.05 0.9917 11489 5761 pw>0.05 0.0002 52 13038 snver ALL 0.6006 11904 16729 pw<=0.05 0.9934 11812 5470 pw>0.05 0.0007 92 11259
  18. 18. Signal Tx %Concordance FBVC_vs_FBVC Marked ALL 85.64 pw<=0.05 91.08 pw>0.05 35.66 FBVC_vs_FBVC Realigned ALL 83.82 pw<=0.05 91.69 pw>0.05 28.21 FBVC_vs_FBVC Recalibrated ALL 93.14 pw<=0.05 ***99.39 pw>0.05 48.53 FBVC_vs_FBVC Reduced ALL 21.54 pw<=0.05 24.57 pw>0.05 4.25 FBVC_vs_FBVC Marked-Realigned ALL 76.91 pw<=0.05 86.11 pw>0.05 15.44 FBVC_vs_FBVC Marked-Realigned-Recalibrated ALL 76.73 pw<=0.05 85.99 pw>0.05 15.34 FBVC_vs_FBVC Marked-Realigned-Recalibrated-Reduced ALL 19.98 pw<=0.05 22.9 pw>0.05 2.66
  19. 19. Information Theory Consensus Analysis e.g.,2/3, ¾, set analysis (-> modeling) Improve Callers (fix errors, modeling) Bake Offs LOW CONCORDANCE (O’Rawe et al., 2013) VARIANT CALLERS MAPPER SEQUENCER TRUTH (BIOLOGICAL MOLECULAR SEQUENCE) Simulations Spiked Ins
  20. 20. Lifescope reads (read) Shrimp2 reads (blue) Mappers must be systematically evaluated
  21. 21. Part 2: Good and Bad News for RNASeq (and everything else): The Bad News: Fold Change is Biased. The Good News: We have identified a much less biased method.
  22. 22. T-test is not appropriate for small N, large P data (such as RNASeq)
  23. 23. Fold Change > 2.0 Delta > 25
  24. 24. FC(A/B) is Blind to Large Portions of Your Data FC(A/B) Delta (and J5: Patel & Lyons-Weiler, 2004)
  25. 25. Ratio are Hard to Interpret as Biological Differences Gene A B delta (A-B) FC(A/B) gene1 5 3 2 1.667 gene2 50 30 20 1.667 gene3 500 300 200 1.667 gene4 5000 3000 2000 1.667 gene5 50000 30000 20000 1.667
  26. 26. A-B is a difference A/B is a quotient.
  27. 27. Log2 Transformation Does not Help Reveals Minor Delta (&J5) Bias Pink = FC(A/B) Black = Delta
  28. 28. G-Thresholding J5
  29. 29. FC Bias in Amyotrophic Lateral Sclerosis 350000 300000 250000 200000 150000 100000 50000 0 0 50000 100000 150000 200000 Control ALS DEGy FCDEGy Black circles = FC(A/B). Pink = Gthr-J5 genes
  30. 30. FC(A/B) Bias in Alchohol-Induced Hepatitis Black circles = FC(A/B). Pink = Gthr-J5 genes
  31. 31. Conclusions • Not all NGS/HTS sites have sufficient genotypic signal to warrant a base call. High coverage alone does not provide a solution. • By measuring genotypic signal, we can determine which sites we can call with confidence. • Fold-change(FC(A/B) is blind to highly expressed genes and should be abandoned as a measure of differential expression altogether – even for single gene or single protein studies! • Published microarray data sets analyzed to date using FC(A/B) only are a gold-mine for re-analysis using less biased methods.
  32. 32. Credits and Contact • pw, pHom, etc: James Lyons-Weiler, Alan Twaddle, Rahil Sethi. – (MS in preparation) – Our software is called Gconf (not yet available) • Fold-Change Bias: James Lyons-Weiler, Tamanna Sultana, Rick Jordan, Rahil Sethi – (Paper in review) – For now, read • Mariani TJ, Budhraja V, Mecham BH, Gu CC, Watson MA, Sadovsky Y. 2003. A variable fold change threshold determines significance for expression microarrays. FASEB J. 17:321-3. doi: 10.1096/fj.02-0351fje • Pearson, K. 1897. On a form of spurious correlation that may arise when indices are used for the measurement of organs. Proc Roy Soc Lond 60:489-498 doi: 10.1098/rspl.1896.0076

×