This is an overview of advanced multivariate statistical methods which have become very relevant in many domains over the last few decades. These methods are powerful and can exploit the massive datasets implemented today in meaningful ways. Typically analytics platforms do not deploy these statistical methods, in favor of straightforward metrics and machine learning, and thus they are often overlooked. Additional references are available as documented.
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Overview of Multivariate Statistical Methods
1. Overview of Multivariate
Statistical Methods
Thomas Uttaro, Ph.D., M.S.
Deputy Director and CIO,
South Beach Psychiatric Center
11th
Annual NYS-OMH Institute on Mental
Health Management Information
2. Introduction
Concerned with data collected on several dimensions of
the same individual or units of analyses such as
geographic regions.
Common in social, behavioral, life, and medical
sciences; medical and mental health outcomes,
economic indicators, demography.
Extension of Univariate statistics, analysis of variation for
a single random variable, t tests, Correlation,
Regression, ANOVA, ANCOVA, Survival Analysis
Multivariate techniques account for correlation of
measures due to common source for each individual or
other unit of analysis. These techniques also control
type I error rates with overall (experimentwise)
significance tests.
3. Preview of Multivariate Methods
Multivariate General Linear Model-the
extension of ANOVA, ANCOVA, and regression
to a family of methods for multivariate outcomes.
Principal Components Analysis-accounts for
variation in multivariate observations with a
smaller number of observed indices that are
linear combinations of the original variables.
Factor Analysis -accounts for variation in
multiple outcomes with a linear combination of
unobserved factors and a variable specific term.
4. Preview of Multivariate Methods (cont.)
Discriminant Analysis-concerned with
separating observations into known groups
based on multivariate observations.
Cluster Analysis-concerned with identification
of unknown but interpretable groups and placing
individual observations within them.
Canonical Correlation-individual variables are
divided into two groups, concerned with
describing the relationship between the two sets
through multivariate correlations.
5. General Linear Model: ANOVA, ANCOVA
and Multiple Regression
General Linear Model unified framework relates ANOVA,
ANCOVA, and Regression methods.
ANOVA predicts or relates factor or categorical
predictors to a single predicted continuous variable.
Multiple Regression predicts or relates continuous
variables to a single predicted continuous variable.
ANCOVA predicts or relates factor and continuous
variables to a single predicted continuous variable.
A regression approach can be used with dummy
variables to perform ANOVA, hence the GLM model.
Variations on these models exist to predict binary or
categorical outcomes (logistic regression, multinomial
regression).
6. ANOVA and Multiple Regression
Examples using SPSS 10.5
ANOVA example relates FACA (2 levels, perhaps gender), FACB (3
levels, perhaps region) and the interaction FACA x FACB to a single
dependent variable (perhaps annual income).steve296anova.SPS
F test indicates that the main effects FACA and FACB are significant
at the p<.001 level, FACA x FACB interaction is non-significant.
Multiple regression example predicts instructor evaluation from 5
predictors: clarity, stimulation, knowledge, interest, and course
evaluation. Variables are continuous. Accounts for correlation
among predictor variables and determines which are most
important.stevep84MBAreg.SPS
t tests of regression coefficients indicates that all variables except
interest are significant in predicting the level of instructor evaluation.
Several diagnostics are output including residuals, leverages, and
Cook distance (influential data points) values. Plot of regression
standardized residuals should be approximately normal.
7. Multivariate General Linear Model
Extensions of single dependent variable procedures
such as ANOVA, ANCOVA, and multiple regression.
Statistical framework includes MANOVA (factor
predictors), MANCOVA (factors and continuous
predictors), and multivariate multiple regression
(continuous predictors).
Prevents inflated overall type I error rate, accounts for
correlations among the predictors, can detect the joint
significance of a set of variables, even when univariate
analyses would not be significant.
Hotelling’s T2
is the overall multivariate test statistic and a
generalization of the univariate t. Tests the Ho that the
population mean vectors are equal for two or more groups.
8. Multivariate General Linear Model (cont.)
Multivariate significance implies that there is a linear
combination of dependent variables (the discriminant
function) that is separating the k groups.
Multivariate test statistics are a function of eigenvalues
which are fundamental to all multivariate analyses.
Four multivariate test statistics are commonly used,
Wilk’s Λ, Roy’s largest root, the Hotelling-Lawley trace,
and the Pillai Bartlett trace. Wilk’s Λ is most common.
Following a significant finding, post hoc or planned
comparisons are then used to determine which variables
are driving the significance between groups.
9. Multivariate Regression Example using
SPSS 10.5
Timm data on differences in cognitive tests due to
learning tasks. Scores on Ravin’s Progressive Matrices
and Peabody Picture Vocabulary regressed on 3
learning tasks.steve132multivreg.SPS
Multivariate test statistic Wilk’s Λ is significant indicating
a significant relationship between the dependent
variables and the 3 predictors beyond the .01 level.
Univariate F tests examine the regression on each
variable separately. In particular, NA (named action) is
related to PEVOCAB at t=2.68, p<.011.
Univariate prediction equations do not take into account
correlations among dependent variables.
10. MANOVA Example with Tukey
post-hoc tests using SAS V8
Novince data on improving social skills among college
women. 3 groups: control, behavioral rehearsal and
cognitive restructuring, 4 variables: anxiety, social
interaction skills, appropriateness, and assertiveness.
SAS program used for 3 treatment group MANOVA with
4 measures to determine treatment effectiveness.
SASstevep204.sas
Overall significant multivariate tests indicate true
differences between groups on one or more variables
and their linear combinations. Excellent optional output.
Tukey post hoc tests generate significance levels and
confidence intervals to examine effects of variables.
11. Crisis Residence Treatment and the
Basis-32 at South Beach PC
Treatment, gender, and GAF covariate effects on
BASIS-32 subscale scores, n=73 paired admissions and
discharges.
Highly significant pre/post treatment effect, F=4.216, 5
df, p<.001. CR effective in terms of all BASIS-32
subscales.
Significant GAF covariate effect F=5.271, 5 df, p<.001
strong relationship between clinician GAF and self-report
BASIS-32 on relationship to self/others and depression
subscales.
Gender by treatment interaction non-significant. CR
equally effective for both genders in terms of subscale
scores
Statistical diagnostics indicate excellent power for all
tests.
12. Principal Components Analysis
Analysis based on a large number of original variables
can be simplified to a smaller number of standardized
linear combinations of original variables.
x→y=Γ'(x-μ) where Γ is orthogonal Γ'ΣΓ=Λ. The ith
principal component of x may be defined as the ith
element of vector y, as yi=φ'i(x-μ), leading to uncorrelated
principal components. Principal components essentially
involves finding the eigenvalues of the covariance matrix
Σ.
The first principal component has the largest variance of
all standardized linear combinations of x.
13. Principal Components Example using
S-Plus V6PrinComp.ssc
Ph.D. qualifying examinations in five areas of
mathematics for 25 students.
Analysis carried out using S-Plus princomp function
which returns object of mode princomp.
A large coefficient (absolute value) corresponds to a high
loading, while a coefficient near zero has a low loading.
First principal component loadings are of moderate size
in the same direction representing an average score.
Second principal component contrasts two closed book
exams with three open book exams, with the first and
last exams weighted most heavily.
Plots of the principal component loadings and the biplot of
original and transformed test scores in two dimensional
principal component space.
14. Factor Analysis
Factor Analysis explains correlations between observed
variables with underlying factors.
x=μ+Λf+u Λ={λij} is matrix of factor loadings, f and u represent
the common and unique factors respectively. Equivalently,
Σ=ΛΛ'+Ψ, decomposition into factor and error covariances.
Diagonal of factor covariance matrix is the vector of
communalities h2
, common variation in the factors and Ψij is
the vector of uniquenesses, the variation in xi not shared with
the other variables. These sum to 1 for each variable.
Factor solution is not unique. Factors can be rotated to ease
interpretation via Σ=(ΛG')+(G'Λ')+Ψ. Δ= ΛG is the matrix of
rotated factor loadings. Analyst seeks simple structure in the
rotation. Each variable should load highly on one factor and
all factor loadings have large absolute value or are near zero.
15. Factor Analysis Example using
S-Plus V6 FactorAnal.ssc
S-Plus uses factanal, a weighted covariance estimation
function to perform factor analysis.
Using testscores we analφyze whether a two factor
model, overall ability and closed or open book, explain
the overall variation in the scores.
The two factor model explains about 80% of the variation in
the original data, with the first factor accounting for 45%.
The rotated factor loadings indicate the importance of the
first overall ability factor and the relative effects of closed
and open book exams.
Plots of the factor loadings and the biplot of test scores in
two dimensional factor space.
16. Discriminant Function Analysis
Concerned with allocating observations to one or another
a priori defined classes.
Calibrated on a training sample in which membership is
known and then applied to test cases which are
unknown.
In Medicine post-mortem information (classes based on
survival) can used to classify at risk patients for mortality
or morbidity.
An observation is classified into one of two groups on a
series of measurements x1, x2, x3,… xp using a linear
function z of the variables: z=a1x1+a2x2+…+apxp
17. Discriminant Function Analysis
Coefficients maximize ratio of between groups variance of z to
within groups variance. V=a'Ba/a'Sa
The data in both groups have a multivariate normal distribution
and the covariance matrices of each group are the same.
Function evaluates z. Assign to group 1 if zi-zc<0, assign to group
2 if zi-zc≥0.
Performance can be assessed through misclassification rate on
known cases through the training set data.
Significance tests are available including 1) Wilk's Λ or others
previously mentioned and also Hotelling’s multivariate T2
, 2) φ to
see whether the discriminant function differs between groups,
and 3) Chi-square test of Mahalanobis distances from
observations to their group centers, if large chi-square then
unlikely that an observation came from a particular group.
18. Discriminant Function Analysis Example
using SAS V8SASHandDA.sas
Archeological study of two types of skulls from the Tibetan areas
of Sikkim or Kharis (fundamental human type). 5 dimensional
variables measured on 32 skulls.
This can be considered a training set for future classification, the
analysis will also identify the most important variables in
discrimination.
Proc discrim output generates within group and between group
covariance matrices, covariance diagnostics, generalized
pairwise distances between groups, discrimination function
coefficients, and misclassification (resubstitution) rate.
Proc stepdisc finds that faceheight is the most important variable
for classifying the members into groups. Crossvalidated with
another proc discrim using only faceheight.
19. Cluster Analysis
Concerned with allocating observations to discrete groups or
clusters of observations which are unknown.
A hierarchy of solutions from single observation clusters to a single
group cluster containing all observations are displayed in a
dendrogram. An particular clustering partition will be considered
optimal based on statistical and practical criteria.
Clustering methods operate on the inter-individual Euclidian
distance matrix calculated from the raw data.
Single Linkage or Nearest Neighbors -groups are merged a a given
distance if closest individuals from each group are at least the
specified distance.
Complete Linkage or Furthest Neighbors- two groups merge only if
the most distant members are close enough together.
Average Linkage- two groups merge if the average distance
between them is close enough.
20. Cluster Analysis Example
using SAS V8SASHandCA.sas
Analysis of quality of air of U.S. cities. Object is to
identify groups of cities that are similar for policy
intervention.
Clustering variables include SO2, temperature, factories,
population, windspeed, rain, rainydays.
First step is to look for outliers using proc univariate.
Chicago is an outlier on manufacturing and population,
Phoenix has the lowest value on all three climate
variables, these cities are excluded from the analysis.
Results from several runs each based on a different
clustering method are complex and require interpretation
and a feel for the technique.
21. Cluster Analysis Example
using SAS V8 (cont.)
Cluster history indicates the stages at which various cities and
clusters are joined at particular distances along with other
diagnostics.
Bimodalty index of at least .55 suggests clustering on a
particular variable. Factories and population are at .55.
The value of the cubic clustering criterion (ccc) is a guide to
the number of clusters in the data. It peaks at 4 clusters for
the single and complete linkage runs. The number of
eigenvalues of the correlation matrix may also suggest
dimensionality in the data. Four clusters is only an
approximation as the evidence is not that clear.
Dendrograms may also suggest evidence of structure but
generally do not make the optimal number of groups obvious.
22. Cluster Analysis Example
using SAS V8 (cont.)
Means for clustering variables can be examined to
understand how clusters differ on the variables. Mean
differences on these variables can be tested.
Clustering solutions can be displayed by plotting the data
in principal component space since they are linear
transformations of the clustering variables.
In this example the first two principal components are
derived and the individual cluster observations are
graphed. They are distinct in the location of the
observations although the solution is not optimal.
A box plot was created and means tested for differences
on the SO2 level.