Like this presentation? Why not share!

# Multivariate data analysis and visualization tools for biological data

## on Dec 08, 2011

• 2,214 views

### Views

Total Views
2,214
Views on SlideShare
2,210
Embed Views
4

Likes
2
57
0

## Multivariate data analysis and visualization tools for biological dataPresentation Transcript

• Multivariate Data Analysis and Visualization Tools for Understanding Biological Data Dmitry Grapov
• Introduction: Systems Oltvai, et al. Science 25 October 2002: 763-764. Emergent Reductionist Deterministic Systems Complex systems Chemical analysis Physiology Biochemistry Graph theory Modeling Informatics
• Introduction: Inference
• http://www.thefullwiki.org/Hypercube Overview many correlation mean Central Idea: dendrograms heatmaps biplots networks scatter plots histograms densities Representations: matrix matrix vector Properties: Multivariate n-D Bivariate 2-D Univariate 1-D Types:
• Univariate: Properties
• vector of length m
• mean
• variance
• Univariate: Representations
• Univariate: Assumptions
• Normality
• Univariate: Utility
• Hypothesis testing
• α - type I error ( False Positive)
• β - type II error ( False negative)
• power - (1– β )
• effect size - standardized difference in mean
• Univariate: Limitations
• Biological definition of the mean ?
• Relationship between sample size and test power
• Multiple hypothesis testing
• False discovery rate
• Old Faithful Data
• 272 observations
• time between eruptions
• 70 ± 14 min
• duration of eruption
• 3.5 ± 1 min
Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39 , 357–365
• Matrix of 2 vectors of length m
Bivariate: Properties
• ( X , Y ) Bivariate: Representations
• ( X , Y ) Bivariate: Utility
• bivariate distribution
• correlation
Variable 2 = m* Variable 1 + b
• http://en.wikipedia.org/wiki/Correlation Bivariate: Limitations correlation coefficient
• Measure of linear or monotonic relationship
• http://en.wikipedia.org/wiki/Correlation Bivariate: Limitations
• Sensitive to outliers
• Old Faithful Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39 , 357–365
• Old Unfaithful?
• Old Unfaithful?
• Nearby hydrofracking
• Old Unfaithful?
• Nearby hydrofracking
• Challenges
• data often wide structured
• integration
• noise
• Rewards
• robust inference
• signal amplification
• holistic/systems approach
A matrix of n vectors of length m Multivariate: Properties Correlation matrix
• Principal Components Analysis (PCA)
• Linear n-dimensional encoding of original data
• Where dimensions are:
• orthogonal (uncorrelated)
• Top k dimensions are ordered by variance explained
Multivariate: Dimensional Reduction PC 2 PC 1
• Multivariate: Dimensional Reduction Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha.&quot;Singular value decomposition and principal component analysis&quot;. in  A Practical Approach to Microarray Data Analysis . D.P. Berrar, W. Dubitzky, M. Granzow, eds. pp. 91-109, Kluwer: Norwell, MA (2003). LANL LA-UR-02-4001. Scores Loadings Explained variance m x PC PC x PC n x PC Original Data Calculating PCs: singular value decomposition (SVD)
• Eigenvalue
• explained variance
• Scores
• sample representation based on all variables
• variable contribution to scores
• Old Faithful 2.0
• 272 measurements
• 8 variables
• 2 real, 6 random noise
A matrix of n vectors of length m Multivariate: Representations
• Multivariate: Representation Identify outliers using all measurements Use known to impute missing Identify interesting groups Evaluate uni- and bivariate observations
• Number of PCs can be used true data complexity
• PCA: Considerations
• data pre-treatment
• outliers
• noise
• unsupervised projection
no pre-treatment centered and scaled to unit variance
• PCA: Considerations
• data pre-treatment
• outliers
• linear reconstruction
• noise
• Independent components analysis (ICA)
• unsupervised projection
Use ICA to calculate statistically independent components
• PCA: Considerations
• data pre-treatment
• outliers
• linear reconstruction
• noise
• supervised projection
• Non-negative matrix factorization (NMF)
NMF uses additive parts based encoding Learning the parts of objects by nonnegative matrix factorization, D.D. Lee,H.S. Seung, Zhipeng Zhao, ppt.
• PCA: Considerations
• data pre-treatment
• outliers
• linear reconstruction
• noise
• supervised projection
• Identify projection correlated with class assignment (classification) or continuous variables (regression)
• Partial Least Squares Projection to Latent Structures (PLS/-DA)
• PLS/-DA: Utility
• Strengths
• Predict multiple dependent variables
• avoids issues of multicollinearity
• Independent measure of variable importance
• Weaknesses
• Need to derive an empirical reference for model performance
• Poor established model optimization methods
• PLS-DA: Example
• Data: Old Faithful 2.0
• 272 observations on 8 variables
• Latent Variables are analogous to PCs
• Important Statistics (CV)
• Q2 = fit
• RMSEP = error of prediction
• AU(RO)C = specificity vs. sensitivity
Select the appropriate number Latent Variables (LVs) to maximize Q2
• PLS-DA: Performance
• Use permutation tests to empirically determine model performance
• PLS-DA: Performance
• Use permutation tests to empirically determine model performance
• PLS: Predictive Performance
• Split data into training (2/3) and test sets (1/3)
• Generate model using training set and then predict class assignment for test set
• Use permutation tests to generate confidence bounds for future predictions
• PLS: Predictive Performance
• PLS: Feature Selection Use the PLS-DA as an objective function to identify the most informative variables
• Networks
• Network: representation of relationships among objects
• Utility
• Project statistical results into a biological context
• Explore informative data aspects in the context of all that was observed.
• Identify emergent patterns
• Networks
• Interpret statistical results within a biological context
• Networks
• Highlight changes in patterns of relationships.
non-diabetics type 2 diabetics
• Networks
• Display complex interactions
non-diabetics type 2 diabetics
• non-diabetics type 2 diabetics imDEV : interactive modules for Data Exploration and Visualization   An integrated environment for systems level analysis of multivariate data. http:// sourceforge.net/apps/mediawiki/imdev
• Acknowledgements Newman Lab Designated Emphasis in Biotechnology (DEB) NIH This project is funded in part by the NIH grant NIGMS-NIH T32-GM008799, USDA-ARS 5306-51530-019-00D, and NIH-NIDDK R01DK078328 -01.