On the analysis of relative abundances in ecogenomics - David Lovell

1,261 views

Published on

In ecogenomics and other areas of molecular systems biology, many measurement methods yield only the relative abundance of different molecular components. Treating such data as though they carried information about absolute abundance can be very misleading. With relative abundances, differential expression needs careful interpretation, and correlation—a statistical workhorse for analyzing pairwise relationships—is an inappropriate measure of association. Using data on absolute and relative gene expression in yeast, we show why relative abundances need special analysis and interpretation, and present principles for doing so based on the theory of compositional data analysis. We show how correlation of data that carry only relative information can lead to conclusions opposite to those drawn from absolute abundances. We present a well-principled alternative—proportionality—and show how it can be used as the basis of analyses familiar in molecular bioscience. We also talk about some of the challenges of applying this approach to ecogenomic and similar count data that are rich in zeros and low counts.

Published in: Science
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,261
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • “We typically request that users provide us with 11 μg of total RNA/sample (at a concentration of 1.25 μg/uL)” is quoted from http://dmaf.biochem.uci.edu/content/affymetrix-guidelines-preparation
    “This protocol is optimized for 0.1–4 μg of total RNA…” is quoted from the TruSeq™ RNA Sample Preparation Guide
  • “We typically request that users provide us with 11 μg of total RNA/sample (at a concentration of 1.25 μg/uL)” is quoted from http://dmaf.biochem.uci.edu/content/affymetrix-guidelines-preparation
    “This protocol is optimized for 0.1–4 μg of total RNA…” is quoted from the TruSeq™ RNA Sample Preparation Guide
  • “We typically request that users provide us with 11 μg of total RNA/sample (at a concentration of 1.25 μg/uL)” is quoted from http://dmaf.biochem.uci.edu/content/affymetrix-guidelines-preparation
    “This protocol is optimized for 0.1–4 μg of total RNA…” is quoted from the TruSeq™ RNA Sample Preparation Guide
  • On the analysis of relative abundances in ecogenomics - David Lovell

    1. 1. … in ecogenomics On the analysis of relative abundances David Lovell, Vera Pawlowsky-Glahn and Juan José Egozcue Ecogenomics from Data to Knowledge| Canberra | 14 February 2014 CSIRO COMPUTATIONAL INFORMATICS
    2. 2. …but perhaps “too much” if you don’t like maths   Logarithms  Variances  Slope of a regression line  Goodness-of-fit of a regression line  This presentation is meant for everyone: stop me if I’ve lost you! Warning! There will be a little maths… not a lot… On the analysis of relative abundances | David Lovell | Page 2
    3. 3. Logarithms… turn multiplication into addition of exponents Warning! There will be a little maths… not a lot… On the analysis of relative abundances | David Lovell | Page 3
    4. 4. Variance… measures how far a set of numbers is spread out Warning! There will be a little maths… not a lot… less variance more variance On the analysis of relative abundances | David Lovell | Page 4
    5. 5. Slope and goodness of fit… What line best fits the points? And how close are the points to it? Warning! There will be a little maths… not a lot… On the analysis of relative abundances | David Lovell | Page 5
    6. 6.  …conclusions that are about the Universe  …not artefacts of the way we have observed it Motivation Want to draw sound conclusions from our observations On the analysis of relative abundances | David Lovell | Page 6
    7. 7. How can you tell if you’re measuring relative information?  “Would doubling the amount of starting material double my measurements?” Different steps in measurement can render data relative  Sample preparation – obtaining a fixed mass and concentration of nucleic acid for further analysis – “We typically request that users provide us with 11 μg of total RNA/sample (at a concentration of 1.25 μg/uL)” – “This protocol is optimized for 0.1–4 μg of total RNA…”  Measurement platform – “How many reads in a typical DNA/RNA-seq experiment? A typical read count… may have hundreds of millions” http://www.vlsci.org.au/lscc/rna-seq – A large, but finite amount… what would doubling the starting material do? Motivation – more specific A lot of bioscience data carry only relative information On the analysis of relative abundances | David Lovell | Page 7
    8. 8. Question to the audience:  How often is about absolute microbial abundance observed or estimated? – Microbes per gram of soil? – Microbes per litre of seawater? Motivation – more specific A lot of bioscience data carry only relative information On the analysis of relative abundances | David Lovell | Page 8
    9. 9. Relative data needs to be analysed and interpreted differently to absolute data  Forget that, and you can easily draw the wrong conclusions from your data A lot of bioscience data are relative So what? On the analysis of relative abundances | David Lovell | Page 9
    10. 10. I will  Show you why correlation is not appropriate for relative data  Give you an alternative measure of association that can be used confidently The focus of this talk: correlation A workhorse of bioinformatics and quantitative bioscience On the analysis of relative abundances | David Lovell | Page 10
    11. 11. 1. Spurious correlation 2. Geometry of relative data 3. Real-life examples of yeasty goodness from the Bähler Lab Correlation for relative data? Three nails in its coffin On the analysis of relative abundances | David Lovell | Page 11
    12. 12. Ingredients  Three (3) statistically independent variables x, y and z – Note that the correlation between x and y should be about zero Method  Form the ratios x/z and y/z  Plot the ratios against each other  Watch correlation magically appear Serves…  This recipe has served entire research disciplines for some years, despite Karl Pearson’s warning in 1896 For a demonstration…  See http://en.wikipedia.org/wiki/Spurious_correlation Nail #1: Spurious correlation A recipe for disaster On the analysis of relative abundances | David Lovell | Page 12
    13. 13. Spurious correlation in action Cooking the books Have you got things in proportion? | David Lovell | Page 13
    14. 14. Spurious correlation in action Cooking the books Have you got things in proportion? | David Lovell | Page 14
    15. 15. Spurious correlation in action Cooking the books Have you got things in proportion? | David Lovell | Page 15
    16. 16. Imagine we’ve measured the relative abundance of some microbes  We have made measurements in seven different samples  Let’s focus in on the relationship between microbe1 and microbe2 Nail #2: Geometry of relative data On the analysis of relative abundances | David Lovell | Page 16
    17. 17. Have you got things in proportion? | David Lovell | Page 17
    18. 18. Have you got things in proportion? | David Lovell | Page 18 40% 30% Sample #1
    19. 19. Have you got things in proportion? | David Lovell | Page 19 37.5% 35% Sample #2
    20. 20. Have you got things in proportion? | David Lovell | Page 20 35% 40% Sample #3
    21. 21. Have you got things in proportion? | David Lovell | Page 21 32.5% 45% Sample #4
    22. 22. Have you got things in proportion? | David Lovell | Page 22 30% 50% Sample #5
    23. 23. Have you got things in proportion? | David Lovell | Page 23 27.5% 55% Sample #6
    24. 24. Have you got things in proportion? | David Lovell | Page 24 25% 60% Sample #7
    25. 25. Have you got things in proportion? | David Lovell | Page 25
    26. 26. Have you got things in proportion? | David Lovell | Page 26 What do these pairs of relative abundances tell us about the relationship between the absolute numbers of microbe1 and microbe2 in the seven environments sampled?
    27. 27. Have you got things in proportion? | David Lovell | Page 27 If the numbers of microbe1 increase, what are the numbers of microbe2 likely to do?
    28. 28. Have you got things in proportion? | David Lovell | Page 28 Hint: each pair of relative values comes from a pair of absolute values that lie on the ray from the origin
    29. 29. Have you got things in proportion? | David Lovell | Page 29
    30. 30. Have you got things in proportion? | David Lovell | Page 30
    31. 31. Have you got things in proportion? | David Lovell | Page 31
    32. 32. Have you got things in proportion? | David Lovell | Page 32
    33. 33. Have you got things in proportion? | David Lovell | Page 33 Correlations between relative values tell us nothing about the relationships between the absolute values that gave rise to them
    34. 34. Have you got things in proportion? | David Lovell | Page 34 Correlations between relative values tell us nothing about the relationships between the absolute values that gave rise to them
    35. 35. Have you got things in proportion? | David Lovell | Page 35 Correlations between relative values tell us nothing about the relationships between the absolute values that gave rise to them …unless total abundance is constant
    36. 36. Have you got things in proportion? | David Lovell | Page 36 Correlations between relative values tell us nothing about the relationships between the absolute values that gave rise to them …unless total abundance is constant OR the pairs of relative values are proportional
    37. 37. Proportionality A meaningful measure of association for relative data yx t y t x  On the analysis of relative abundances | David Lovell | Page 37
    38. 38. Have you got things in proportion? | David Lovell | Page 38
    39. 39. Have you got things in proportion? | David Lovell | Page 39
    40. 40. Have you got things in proportion? | David Lovell | Page 40
    41. 41. Have you got things in proportion? | David Lovell | Page 41
    42. 42. Have you got things in proportion? | David Lovell | Page 42
    43. 43. Have you got things in proportion? | David Lovell | Page 43 When pairs of relative values stay in proportion across samples so do their corresponding absolute values
    44. 44. Have you got things in proportion? | David Lovell | Page 44 When pairs of relative values stay in proportion across samples so do their corresponding absolute values …when we work with relative values we lose the ability to infer correlations, but we can still infer proportionalities
    45. 45. Q: How do we measure proportionality? A: Use a log-log plot natural scale log-log scale Pairs of values that behave proportionally fit a line of slope 1 (i.e., 45°) on a log-log scale On the analysis of relative abundances | David Lovell | Page 45
    46. 46. slope ( ) in degrees proportionofvarianceexplained(r2)
    47. 47. slope ( ) in degrees proportionofvarianceexplained(r2)
    48. 48. slope ( ) in degrees proportionofvarianceexplained(r2)  rxyx  21 2  )var(log)/var(log ()  r21 2
    49. 49. The data:  expression levels of over 3000 yeast genes over a 16-point time course  …ok, so this is not ecogenomic data – But imagine this is about the abundance of 3000 OTUs in 16 different samples  Total abundance is anything but constant Nail #3: Real-life examples Yeasty goodness from the Bähler Lab On the analysis of relative abundances | David Lovell | Page 49
    50. 50. Have you got things in proportion? | David Lovell | Page 50
    51. 51. Have you got things in proportion? | David Lovell | Page 51
    52. 52. Have you got things in proportion? | David Lovell | Page 52
    53. 53. Have you got things in proportion? | David Lovell | Page 53
    54. 54. Have you got things in proportion? | David Lovell | Page 54
    55. 55. Have you got things in proportion? | David Lovell | Page 55
    56. 56. slope ( ) in degrees proportionofvarianceexplained(r2)
    57. 57. Have you got things in proportion? | David Lovell | Page 57 145i.e.,1at ofminimum 2  r y)xφ( ),( log,log  
    58. 58. Let’s not. Instead, let’s bin the values of slope and r2 for all 30313030/2 pairs of mRNA and show the counts on a heatmap How does this look for yeast? …let’s plot the slope and r2 for 4.5 million pairs of mRNAs… yay! On the analysis of relative abundances | David Lovell | Page 58
    59. 59. Have you got things in proportion? | David Lovell | Page 59 )log,(log yx ofcontours0.05and0.025
    60. 60. Have you got things in proportion? | David Lovell | Page 60
    61. 61. Have you got things in proportion? | David Lovell | Page 61
    62. 62. Have you got things in proportion? | David Lovell | Page 62
    63. 63. Have you got things in proportion? | David Lovell | Page 63
    64. 64. Have you got things in proportion? | David Lovell | Page 64
    65. 65. Have you got things in proportion? | David Lovell | Page 65
    66. 66. We have been working with Human Microbiome Data this summer  Dang! There’s a lot of zeros – Zeros… the natural enemy of the logarithm  Also there are a lot of small counts: 1, 2, 3… – Counts (integers ≥ 0) do not carry only relative information – Expected counts do (reals ≥ 0)  Yes there are measures of association or dissimilarity that “handle” these data – Bray-Curtis, UniFrac… – But what reliable inferences can you make when OTUs play hide and seek? Also this summer  I have been become deeply suspicious of the practical value of null hypothesis significance testing (p-values) – Esp., “statistically significant correlation coefficients” Challenges with ecogenomic data Can we make sense of microbial relative abundance? On the analysis of relative abundances | David Lovell | Page 66
    67. 67.  Jürg Bähler and Sam Marguerat for data and wisdom on yeast  The Spanish Government for supporting VP-G and JJE’s collaboration with DL  Jack Simpson and Lauren Bragg for helping to make sense of microbes this summer  Karoline Faust and co-authors for graciously providing the data matrices from – Faust, Karoline, J. Fah Sathirapongsasuti, Jacques Izard, Nicola Segata, Dirk Gevers, Jeroen Raes, and Curtis Huttenhower. “Microbial Co-Occurrence Relationships in the Human Microbiome.” PLoS Comput Biol 8(7) (2012)  David Warton and co-authors for an excellent paper on Standardised Major Axis fitting – Warton, David I., Ian J. Wright, Daniel S. Falster, and Mark Westoby. 2006. “Bivariate Line-fitting Methods for Allometry.” Biological Reviews 81 (2): 259–291.  The R Core Team and developers of R packages including – Daniel Adler, Duncan Murdoch, Yihui Xie, Hadley Wickham, Gerald van den Boogaart and Raimon Tolosano-Delgado  Rachel, Felix and Ava (my wife and kids) for putting up with me on 5 hours sleep for weeks on end Thanks and acknowledgements On the analysis of relative abundances | David Lovell | Page 67
    68. 68. Thank you CSIRO Transformational Biology David Lovell Bioinformatics and Analytics Leader t +61 2 6216 7042 e David.Lovell@csiro.au w www.csiro.au/people/David.Lovell CSIRO MATHEMATICS, INFORMATICS AND STATISTICS

    ×