• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Multivariate data analysis and visualization tools for biological data
 

Multivariate data analysis and visualization tools for biological data

on

  • 2,214 views

 

Statistics

Views

Total Views
2,214
Views on SlideShare
2,210
Embed Views
4

Actions

Likes
2
Downloads
57
Comments
0

1 Embed 4

http://www.linkedin.com 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

Multivariate data analysis and visualization tools for biological data Multivariate data analysis and visualization tools for biological data Presentation Transcript

  • Multivariate Data Analysis and Visualization Tools for Understanding Biological Data Dmitry Grapov
  • Introduction: Systems Oltvai, et al. Science 25 October 2002: 763-764. Emergent Reductionist Deterministic Systems Complex systems Chemical analysis Physiology Biochemistry Graph theory Modeling Informatics
  • Introduction: Inference
  • http://www.thefullwiki.org/Hypercube Overview many correlation mean Central Idea: dendrograms heatmaps biplots networks scatter plots histograms densities Representations: matrix matrix vector Properties: Multivariate n-D Bivariate 2-D Univariate 1-D Types:
  • Univariate: Properties
    • vector of length m
      • mean
      • variance
  • Univariate: Representations
  • Univariate: Assumptions
    • Normality
  • Univariate: Utility
    • Hypothesis testing
      • α - type I error ( False Positive)
      • β - type II error ( False negative)
      • power - (1– β )
      • effect size - standardized difference in mean
  • Univariate: Limitations
    • Biological definition of the mean ?
    • Relationship between sample size and test power
    • Multiple hypothesis testing
      • False discovery rate
  • Old Faithful Data
    • 272 observations
    • time between eruptions
      • 70 ± 14 min
    • duration of eruption
      • 3.5 ± 1 min
    Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39 , 357–365
    • Matrix of 2 vectors of length m
    Bivariate: Properties
  • ( X , Y ) Bivariate: Representations
  • ( X , Y ) Bivariate: Utility
    • bivariate distribution
    • correlation
    Variable 2 = m* Variable 1 + b
  • http://en.wikipedia.org/wiki/Correlation Bivariate: Limitations correlation coefficient
    • Measure of linear or monotonic relationship
  • http://en.wikipedia.org/wiki/Correlation Bivariate: Limitations
    • Sensitive to outliers
  • Old Faithful Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39 , 357–365
  • Old Unfaithful?
  • Old Unfaithful?
    • Additional variables
    • Nearby hydrofracking
    • Improve inference based on more information
  • Old Unfaithful?
    • Additional variables
    • Nearby hydrofracking
    • Improve inference based on more information
    • Challenges
    • data often wide structured
    • integration
    • noise
    • Rewards
    • robust inference
    • signal amplification
    • holistic/systems approach
    A matrix of n vectors of length m Multivariate: Properties Correlation matrix
    • Principal Components Analysis (PCA)
    • Linear n-dimensional encoding of original data
    • Where dimensions are:
      • orthogonal (uncorrelated)
      • Top k dimensions are ordered by variance explained
    Multivariate: Dimensional Reduction PC 2 PC 1
  • Multivariate: Dimensional Reduction Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha."Singular value decomposition and principal component analysis". in  A Practical Approach to Microarray Data Analysis . D.P. Berrar, W. Dubitzky, M. Granzow, eds. pp. 91-109, Kluwer: Norwell, MA (2003). LANL LA-UR-02-4001. Scores Loadings Explained variance m x PC PC x PC n x PC Original Data Calculating PCs: singular value decomposition (SVD)
    • Eigenvalue
    • explained variance
    • Scores
    • sample representation based on all variables
    • Loadings
    • variable contribution to scores
    • Old Faithful 2.0
    • 272 measurements
    • 8 variables
    • 2 real, 6 random noise
    A matrix of n vectors of length m Multivariate: Representations
  • Multivariate: Representation Identify outliers using all measurements Use known to impute missing Identify interesting groups Evaluate uni- and bivariate observations
    • Number of PCs can be used true data complexity
  • PCA: Considerations
    • data pre-treatment
    • outliers
    • noise
    • unsupervised projection
    no pre-treatment centered and scaled to unit variance
  • PCA: Considerations
    • data pre-treatment
    • outliers
    • linear reconstruction
    • noise
      • Independent components analysis (ICA)
    • unsupervised projection
    Use ICA to calculate statistically independent components
  • PCA: Considerations
    • data pre-treatment
    • outliers
    • linear reconstruction
    • noise
    • supervised projection
      • Non-negative matrix factorization (NMF)
    NMF uses additive parts based encoding Learning the parts of objects by nonnegative matrix factorization, D.D. Lee,H.S. Seung, Zhipeng Zhao, ppt.
  • PCA: Considerations
    • data pre-treatment
    • outliers
    • linear reconstruction
    • noise
    • supervised projection
      • Identify projection correlated with class assignment (classification) or continuous variables (regression)
      • Partial Least Squares Projection to Latent Structures (PLS/-DA)
  • PLS/-DA: Utility
    • Strengths
    • Predict multiple dependent variables
    • avoids issues of multicollinearity
    • Independent measure of variable importance
    • Weaknesses
    • Need to derive an empirical reference for model performance
    • Poor established model optimization methods
  • PLS-DA: Example
    • Data: Old Faithful 2.0
      • 272 observations on 8 variables
    • Latent Variables are analogous to PCs
    • Important Statistics (CV)
      • Q2 = fit
      • RMSEP = error of prediction
      • AU(RO)C = specificity vs. sensitivity
    Select the appropriate number Latent Variables (LVs) to maximize Q2
  • PLS-DA: Performance
    • Use permutation tests to empirically determine model performance
  • PLS-DA: Performance
    • Use permutation tests to empirically determine model performance
  • PLS: Predictive Performance
    • Split data into training (2/3) and test sets (1/3)
    • Generate model using training set and then predict class assignment for test set
    • Use permutation tests to generate confidence bounds for future predictions
  • PLS: Predictive Performance
  • PLS: Feature Selection Use the PLS-DA as an objective function to identify the most informative variables
  • Networks
    • Network: representation of relationships among objects
    • Utility
    • Project statistical results into a biological context
    • Explore informative data aspects in the context of all that was observed.
    • Identify emergent patterns
  • Networks
    • Interpret statistical results within a biological context
  • Networks
    • Highlight changes in patterns of relationships.
    non-diabetics type 2 diabetics
  • Networks
    • Display complex interactions
    non-diabetics type 2 diabetics
  • non-diabetics type 2 diabetics imDEV : interactive modules for Data Exploration and Visualization   An integrated environment for systems level analysis of multivariate data. http:// sourceforge.net/apps/mediawiki/imdev
  • Acknowledgements Newman Lab Designated Emphasis in Biotechnology (DEB) NIH This project is funded in part by the NIH grant NIGMS-NIH T32-GM008799, USDA-ARS 5306-51530-019-00D, and NIH-NIDDK R01DK078328 -01.