Multivariate Data Analysis and Visualization Tools for Understanding Biological Data   Dmitry Grapov
Introduction:  Systems Oltvai, et al. Science 25 October 2002: 763-764.   Emergent Reductionist  Deterministic Systems Com...
Introduction:  Inference
http://www.thefullwiki.org/Hypercube  Overview many correlation mean Central Idea: dendrograms heatmaps biplots networks s...
Univariate:  Properties   <ul><li>vector of length m </li></ul><ul><ul><li>mean </li></ul></ul><ul><ul><li>variance </li><...
Univariate:  Representations
Univariate:  Assumptions <ul><li>Normality </li></ul>
Univariate:  Utility <ul><li>Hypothesis testing </li></ul><ul><ul><li>α   -  type I error  ( False Positive) </li></ul></u...
Univariate:  Limitations <ul><li>Biological definition of the mean ? </li></ul><ul><li>Relationship between sample size an...
Old Faithful Data   <ul><li>272 observations </li></ul><ul><li>time between eruptions </li></ul><ul><ul><li>70 ± 14 min </...
<ul><li>Matrix of 2 vectors of length m  </li></ul>Bivariate:  Properties
( X , Y ) Bivariate:  Representations
( X , Y ) Bivariate:  Utility <ul><li>bivariate distribution </li></ul><ul><li>correlation </li></ul>Variable 2  = m* Vari...
http://en.wikipedia.org/wiki/Correlation   Bivariate:  Limitations correlation coefficient <ul><li>Measure of linear or mo...
http://en.wikipedia.org/wiki/Correlation   Bivariate:  Limitations <ul><li>Sensitive to outliers </li></ul>
Old Faithful Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser.  Applied Statistics   ...
Old Unfaithful?
Old Unfaithful? <ul><li>Additional variables </li></ul><ul><li>Nearby hydrofracking </li></ul><ul><li>Improve inference ba...
Old Unfaithful? <ul><li>Additional variables </li></ul><ul><li>Nearby hydrofracking </li></ul><ul><li>Improve inference ba...
<ul><li>Challenges </li></ul><ul><li>data often wide structured </li></ul><ul><li>integration </li></ul><ul><li>noise </li...
<ul><li>Principal Components Analysis (PCA) </li></ul><ul><li>Linear n-dimensional encoding of original data  </li></ul><u...
Multivariate:   Dimensional Reduction Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha.&quot;Singular value decomposi...
<ul><li>Old Faithful 2.0 </li></ul><ul><li>272 measurements </li></ul><ul><li>8 variables </li></ul><ul><li>2 real, 6 rand...
Multivariate:  Representation Identify outliers using all measurements Use known to impute missing Identify interesting gr...
PCA:  Considerations <ul><li>data pre-treatment  </li></ul><ul><li>outliers  </li></ul><ul><li>noise </li></ul><ul><li>uns...
PCA:  Considerations <ul><li>data pre-treatment  </li></ul><ul><li>outliers  </li></ul><ul><li>linear reconstruction </li>...
PCA:  Considerations <ul><li>data pre-treatment  </li></ul><ul><li>outliers  </li></ul><ul><li>linear reconstruction </li>...
PCA:  Considerations <ul><li>data pre-treatment  </li></ul><ul><li>outliers  </li></ul><ul><li>linear reconstruction </li>...
PLS/-DA: Utility <ul><li>Strengths </li></ul><ul><li>Predict multiple dependent variables </li></ul><ul><li>avoids issues ...
PLS-DA: Example <ul><li>Data: Old Faithful 2.0 </li></ul><ul><ul><li>272 observations on 8 variables </li></ul></ul><ul><l...
PLS-DA: Performance <ul><li>Use permutation tests to empirically determine model performance </li></ul>
PLS-DA: Performance <ul><li>Use permutation tests to empirically determine model performance </li></ul>
PLS: Predictive Performance <ul><li>Split data into training (2/3) and test sets (1/3) </li></ul><ul><li>Generate model us...
PLS: Predictive Performance
PLS: Feature Selection Use the PLS-DA as an objective function to identify the most informative variables
Networks <ul><li>Network: representation of relationships among objects </li></ul><ul><li>Utility </li></ul><ul><li>Projec...
Networks <ul><li>Interpret statistical results within a biological context </li></ul>
Networks <ul><li>Highlight changes in patterns of relationships.  </li></ul>non-diabetics type 2 diabetics
Networks <ul><li>Display complex interactions </li></ul>non-diabetics type 2 diabetics
non-diabetics type 2 diabetics imDEV :  interactive modules for Data Exploration and Visualization   An integrated environ...
Acknowledgements Newman Lab  Designated Emphasis in Biotechnology (DEB) NIH This project is funded in part by the NIH gran...
Upcoming SlideShare
Loading in …5
×

Multivariate data analysis and visualization tools for biological data

2,514
-1

Published on

Published in: Education, Technology

Multivariate data analysis and visualization tools for biological data

  1. 1. Multivariate Data Analysis and Visualization Tools for Understanding Biological Data Dmitry Grapov
  2. 2. Introduction: Systems Oltvai, et al. Science 25 October 2002: 763-764. Emergent Reductionist Deterministic Systems Complex systems Chemical analysis Physiology Biochemistry Graph theory Modeling Informatics
  3. 3. Introduction: Inference
  4. 4. http://www.thefullwiki.org/Hypercube Overview many correlation mean Central Idea: dendrograms heatmaps biplots networks scatter plots histograms densities Representations: matrix matrix vector Properties: Multivariate n-D Bivariate 2-D Univariate 1-D Types:
  5. 5. Univariate: Properties <ul><li>vector of length m </li></ul><ul><ul><li>mean </li></ul></ul><ul><ul><li>variance </li></ul></ul>
  6. 6. Univariate: Representations
  7. 7. Univariate: Assumptions <ul><li>Normality </li></ul>
  8. 8. Univariate: Utility <ul><li>Hypothesis testing </li></ul><ul><ul><li>α - type I error ( False Positive) </li></ul></ul><ul><ul><li>β - type II error ( False negative) </li></ul></ul><ul><ul><li>power - (1– β ) </li></ul></ul><ul><ul><li>effect size - standardized difference in mean </li></ul></ul>
  9. 9. Univariate: Limitations <ul><li>Biological definition of the mean ? </li></ul><ul><li>Relationship between sample size and test power </li></ul><ul><li>Multiple hypothesis testing </li></ul><ul><ul><li>False discovery rate </li></ul></ul>
  10. 10. Old Faithful Data <ul><li>272 observations </li></ul><ul><li>time between eruptions </li></ul><ul><ul><li>70 ± 14 min </li></ul></ul><ul><li>duration of eruption </li></ul><ul><ul><li>3.5 ± 1 min </li></ul></ul>Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39 , 357–365
  11. 11. <ul><li>Matrix of 2 vectors of length m </li></ul>Bivariate: Properties
  12. 12. ( X , Y ) Bivariate: Representations
  13. 13. ( X , Y ) Bivariate: Utility <ul><li>bivariate distribution </li></ul><ul><li>correlation </li></ul>Variable 2 = m* Variable 1 + b
  14. 14. http://en.wikipedia.org/wiki/Correlation Bivariate: Limitations correlation coefficient <ul><li>Measure of linear or monotonic relationship </li></ul>
  15. 15. http://en.wikipedia.org/wiki/Correlation Bivariate: Limitations <ul><li>Sensitive to outliers </li></ul>
  16. 16. Old Faithful Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39 , 357–365
  17. 17. Old Unfaithful?
  18. 18. Old Unfaithful? <ul><li>Additional variables </li></ul><ul><li>Nearby hydrofracking </li></ul><ul><li>Improve inference based on more information </li></ul>
  19. 19. Old Unfaithful? <ul><li>Additional variables </li></ul><ul><li>Nearby hydrofracking </li></ul><ul><li>Improve inference based on more information </li></ul>
  20. 20. <ul><li>Challenges </li></ul><ul><li>data often wide structured </li></ul><ul><li>integration </li></ul><ul><li>noise </li></ul><ul><li>Rewards </li></ul><ul><li>robust inference </li></ul><ul><li>signal amplification </li></ul><ul><li>holistic/systems approach </li></ul>A matrix of n vectors of length m Multivariate: Properties Correlation matrix
  21. 21. <ul><li>Principal Components Analysis (PCA) </li></ul><ul><li>Linear n-dimensional encoding of original data </li></ul><ul><li>Where dimensions are: </li></ul><ul><ul><li>orthogonal (uncorrelated) </li></ul></ul><ul><ul><li>Top k dimensions are ordered by variance explained </li></ul></ul>Multivariate: Dimensional Reduction PC 2 PC 1
  22. 22. Multivariate: Dimensional Reduction Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha.&quot;Singular value decomposition and principal component analysis&quot;. in  A Practical Approach to Microarray Data Analysis . D.P. Berrar, W. Dubitzky, M. Granzow, eds. pp. 91-109, Kluwer: Norwell, MA (2003). LANL LA-UR-02-4001. Scores Loadings Explained variance m x PC PC x PC n x PC Original Data Calculating PCs: singular value decomposition (SVD) <ul><li>Eigenvalue </li></ul><ul><li>explained variance </li></ul><ul><li>Scores </li></ul><ul><li>sample representation based on all variables </li></ul><ul><li>Loadings </li></ul><ul><li>variable contribution to scores </li></ul>
  23. 23. <ul><li>Old Faithful 2.0 </li></ul><ul><li>272 measurements </li></ul><ul><li>8 variables </li></ul><ul><li>2 real, 6 random noise </li></ul>A matrix of n vectors of length m Multivariate: Representations
  24. 24. Multivariate: Representation Identify outliers using all measurements Use known to impute missing Identify interesting groups Evaluate uni- and bivariate observations <ul><li>Number of PCs can be used true data complexity </li></ul>
  25. 25. PCA: Considerations <ul><li>data pre-treatment </li></ul><ul><li>outliers </li></ul><ul><li>noise </li></ul><ul><li>unsupervised projection </li></ul>no pre-treatment centered and scaled to unit variance
  26. 26. PCA: Considerations <ul><li>data pre-treatment </li></ul><ul><li>outliers </li></ul><ul><li>linear reconstruction </li></ul><ul><li>noise </li></ul><ul><ul><li>Independent components analysis (ICA) </li></ul></ul><ul><li>unsupervised projection </li></ul>Use ICA to calculate statistically independent components
  27. 27. PCA: Considerations <ul><li>data pre-treatment </li></ul><ul><li>outliers </li></ul><ul><li>linear reconstruction </li></ul><ul><li>noise </li></ul><ul><li>supervised projection </li></ul><ul><ul><li>Non-negative matrix factorization (NMF) </li></ul></ul>NMF uses additive parts based encoding Learning the parts of objects by nonnegative matrix factorization, D.D. Lee,H.S. Seung, Zhipeng Zhao, ppt.
  28. 28. PCA: Considerations <ul><li>data pre-treatment </li></ul><ul><li>outliers </li></ul><ul><li>linear reconstruction </li></ul><ul><li>noise </li></ul><ul><li>supervised projection </li></ul><ul><ul><li>Identify projection correlated with class assignment (classification) or continuous variables (regression) </li></ul></ul><ul><ul><li>Partial Least Squares Projection to Latent Structures (PLS/-DA) </li></ul></ul>
  29. 29. PLS/-DA: Utility <ul><li>Strengths </li></ul><ul><li>Predict multiple dependent variables </li></ul><ul><li>avoids issues of multicollinearity </li></ul><ul><li>Independent measure of variable importance </li></ul><ul><li>Weaknesses </li></ul><ul><li>Need to derive an empirical reference for model performance </li></ul><ul><li>Poor established model optimization methods </li></ul>
  30. 30. PLS-DA: Example <ul><li>Data: Old Faithful 2.0 </li></ul><ul><ul><li>272 observations on 8 variables </li></ul></ul><ul><li>Latent Variables are analogous to PCs </li></ul><ul><li>Important Statistics (CV) </li></ul><ul><ul><li>Q2 = fit </li></ul></ul><ul><ul><li>RMSEP = error of prediction </li></ul></ul><ul><ul><li>AU(RO)C = specificity vs. sensitivity </li></ul></ul>Select the appropriate number Latent Variables (LVs) to maximize Q2
  31. 31. PLS-DA: Performance <ul><li>Use permutation tests to empirically determine model performance </li></ul>
  32. 32. PLS-DA: Performance <ul><li>Use permutation tests to empirically determine model performance </li></ul>
  33. 33. PLS: Predictive Performance <ul><li>Split data into training (2/3) and test sets (1/3) </li></ul><ul><li>Generate model using training set and then predict class assignment for test set </li></ul><ul><li>Use permutation tests to generate confidence bounds for future predictions </li></ul>
  34. 34. PLS: Predictive Performance
  35. 35. PLS: Feature Selection Use the PLS-DA as an objective function to identify the most informative variables
  36. 36. Networks <ul><li>Network: representation of relationships among objects </li></ul><ul><li>Utility </li></ul><ul><li>Project statistical results into a biological context </li></ul><ul><li>Explore informative data aspects in the context of all that was observed. </li></ul><ul><li>Identify emergent patterns </li></ul>
  37. 37. Networks <ul><li>Interpret statistical results within a biological context </li></ul>
  38. 38. Networks <ul><li>Highlight changes in patterns of relationships. </li></ul>non-diabetics type 2 diabetics
  39. 39. Networks <ul><li>Display complex interactions </li></ul>non-diabetics type 2 diabetics
  40. 40. non-diabetics type 2 diabetics imDEV : interactive modules for Data Exploration and Visualization   An integrated environment for systems level analysis of multivariate data. http:// sourceforge.net/apps/mediawiki/imdev
  41. 41. Acknowledgements Newman Lab Designated Emphasis in Biotechnology (DEB) NIH This project is funded in part by the NIH grant NIGMS-NIH T32-GM008799, USDA-ARS 5306-51530-019-00D, and NIH-NIDDK R01DK078328 -01.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×