Statistics on big biomedical data
Methods and pitfalls when analyzing high-throughput
screens
Lars Juhl Jensen
Statistics on big biomedical data
Methods and pitfalls when analyzing high-throughput
screens
Lars Juhl Jensen
t-test
ANOVA
normal distribution
useful tests
counts
contingency table
Jensen et al., Nature Reviews Genetics, 2012
Fisher’s exact test
real numbers
no theoretical distribution
non-parametric statistics
do the medians differ?
Mann–Whitney U test
medians can mislead you
do the distributions differ?
Kolmogorov–Smirnov test
does not tell how they differ
resampling
Monte Carlo testing
always applicable
compute intensive
multiple testing
xkcd.com
xkcd.com
xkcd.com
xkcd.com
compare multiple condition
Gene Ontology enrichment
Bonferroni
avoid making any errors
too conservative
Benjamini–Hochberg
control false discovery rate
assumes independence
resampling
negative set
systematic biases
Huang et al., Journal of Proteome Research, 2014
studiedness bias
we study disease proteins
thus we know many PTMs
abundance bias
higher expressed
easier to detect in assays
better characterized
matched background
the big data effect
if you have enough data
any difference is significant
but maybe not relevant
“significant”
statistical significance
p-value
biological relevance
fold change
relative risk
significant and relevant
volcano plots
Lundby et al., Science Signaling, 2013
rather ad hoc
questions?
Statistics on big biomedical data - Methods and pitfalls when analyzing high-throughput screens
Statistics on big biomedical data - Methods and pitfalls when analyzing high-throughput screens
Upcoming SlideShare
Loading in …5
×

Statistics on big biomedical data - Methods and pitfalls when analyzing high-throughput screens

700 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Statistics on big biomedical data - Methods and pitfalls when analyzing high-throughput screens

  1. 1. Statistics on big biomedical data Methods and pitfalls when analyzing high-throughput screens Lars Juhl Jensen
  2. 2. Statistics on big biomedical data Methods and pitfalls when analyzing high-throughput screens Lars Juhl Jensen
  3. 3. t-test
  4. 4. ANOVA
  5. 5. normal distribution
  6. 6. useful tests
  7. 7. counts
  8. 8. contingency table
  9. 9. Jensen et al., Nature Reviews Genetics, 2012
  10. 10. Fisher’s exact test
  11. 11. real numbers
  12. 12. no theoretical distribution
  13. 13. non-parametric statistics
  14. 14. do the medians differ?
  15. 15. Mann–Whitney U test
  16. 16. medians can mislead you
  17. 17. do the distributions differ?
  18. 18. Kolmogorov–Smirnov test
  19. 19. does not tell how they differ
  20. 20. resampling
  21. 21. Monte Carlo testing
  22. 22. always applicable
  23. 23. compute intensive
  24. 24. multiple testing
  25. 25. xkcd.com
  26. 26. xkcd.com
  27. 27. xkcd.com
  28. 28. xkcd.com
  29. 29. compare multiple condition
  30. 30. Gene Ontology enrichment
  31. 31. Bonferroni
  32. 32. avoid making any errors
  33. 33. too conservative
  34. 34. Benjamini–Hochberg
  35. 35. control false discovery rate
  36. 36. assumes independence
  37. 37. resampling
  38. 38. negative set
  39. 39. systematic biases
  40. 40. Huang et al., Journal of Proteome Research, 2014
  41. 41. studiedness bias
  42. 42. we study disease proteins
  43. 43. thus we know many PTMs
  44. 44. abundance bias
  45. 45. higher expressed
  46. 46. easier to detect in assays
  47. 47. better characterized
  48. 48. matched background
  49. 49. the big data effect
  50. 50. if you have enough data
  51. 51. any difference is significant
  52. 52. but maybe not relevant
  53. 53. “significant”
  54. 54. statistical significance
  55. 55. p-value
  56. 56. biological relevance
  57. 57. fold change
  58. 58. relative risk
  59. 59. significant and relevant
  60. 60. volcano plots
  61. 61. Lundby et al., Science Signaling, 2013
  62. 62. rather ad hoc
  63. 63. questions?

×