Multivariate

Multivariate Analysis

Analysis of multiple variables in a
single relationship or set of
relationships.


Some basic concepts of
multivariate analysis
The Variate
 Measurement scales
 Measurement error and multivariate
measurement.
 Statistical significance Vs Statistical
power


1.The Variate
Variate is also called linear
combination.
 The linear combination of variables
with empirically determined weights.
Variate value = w1x1 + w2x2 + ...+ wnxn
x1 ,x2 ,..xn = Observed variable
w1 , w2 ,.. wn = Weight
 The variables are specified by the
researcher.
 The weights are determined by the


2.Measurement Scales
Data analysis involves the
identification and measurement of
variation in a set of variables.
 The researcher cannot identify
variation unless it can be measured.
 Measurement is important for
representing the concept and selection
of appropriate multivariate method of
analysis.


Measurement scales can be classified into
two categories


Non metric ( Qualitative)

 Metric (Quantitative)

Non metric measurement scales




Also called qualitative data. Measures that
describe by indicating the presence or
absence of characteristic or property are
called non metric data.
For e.g.: If a person is male, he cannot be
female. An “ amount” of gender is not
possible, just state of being male or female.
Qualitative measurements can be made
with either a nominal or an ordinal scale.

Nominal Scales
A nominal scale assigns numbers to
identify subjects or objects.
 Nominal scales also known as
categorical scales.
For e.g. In representing gender the
researcher might assign numbers to
each category
(i.e) Assign 2 for females
Assign 1 for males.


Ordinal Scales


In ordinal scales, variables can be ordered or ranked.



Every subject or object can be compared with another
in terms of a “greater than” or “less than” relationship.



It provide only the order of the value, but not measure
of the actual amount.
For e.g. opinion scales that go from “most important”
to “least important” or “strongly agree” to “strongly
disagree”.



Using ordinal scale we cannot find the
amount of the differences between
products.
(i.e.) we cannot answer the question
whether the difference between Product
A and B is greater than the difference
between Product B and C.

Metric Measurement Scales
Metric data also called quantitative
data, these measurements identify subjects
not only on the possession of an attribute
but also by the amount to which the
subject may be characterized by the
attribute.
For e.g. age, height.
 Two metric measurement scales are
 Interval scales
 Ratio scales


Interval scales
Data of real numbers, numbers with a
zero point and can be divided and
compared into other ratio numbers.
For e.g. income, weight, height and age.
 We can answer the question whether the
difference between Product A and B is
greater than the difference between
Product B and C.


MEASUREMENT ERROR AND
MULTIVARIATE
MEASUREMENT

Measurement Error


Not measuring the “true” variable values accurately
due to the inappropriate response scales, data entry
error, or respondent errors.



For e.g.
1. Imposing 7-point rating scales for attribute
measurement when the researcher knows the
respondents can accurately respond only to a 3- point
rating .



2. Responses as to household income may be
reasonably accurate but rarely totally precise.
All variables used in multivariate techniques must be
assumed to have some degree of measurement error.

Validity


Validity – extent to which a measure correctly
represents the concept of study. (i.e.) the
degree to which it is free form any systematic
or nonrandom error.



validity relates not to what should be
measured, but instead to how it is measured

Reliability
Reliability is the degree to which the observed
variable measures the “true” value and is
“error free”.
 More reliable measures will show greater
consistency than less reliable measures.
 Choose the variable with the higher
reliability.
 Validity is concerned with how well the
concept is defined by the measures, whereas
reliability relates to the consistency of the
measures.


Multivariate measurement


Use of two or more variables as indicators(i.e. single
variable used in conjunction with one or more other
variables to form a composite measure) of a single
composite measure.



For e.g. A personality test may provide the answers
to a series of individual questions, which are then
combined to form a single score representing the
personality trait.

Statistical Significance


All multivariate techniques are based on the
statistical inference of a population or a randomly
drawn sample of that population.



Interpreting statistical inferences, the researcher
specify the acceptable levels of statistical error.



5% or 1% Level of significance – it means 95% certain
that our sample results are not due to chance.

Types of statistical error
H0 is true
Accept
Reject


H0 is false

1-α

β
Type 2 error

α
Type 1 error

1-β
Power

There are two types of errors
Type 1 error
Type 2 error



Type 1 error is the probability of rejecting
the null hypothesis when it is true. It is also
known as producer’s risk. Probability of
type 1 error is alpha(α).



Type 2 error is the probability of failing to
reject the null hypothesis when it is false. It
is also known as consumer’s risk.
Probability of type 2 error is beta(β).

Statistical power
 Power is the probability of correctly rejecting the null
hypothesis when it is false.


Power of statistical inference test is 1-β



Increased sample sizes produce greater power of the
statistical test.



Researchers should always design the study to
achieve a power level of 0.80 at the desired
significance level.

What type of
relationship is being
examined?

Dependence

Interdependence

How many variables
are being predicted?

Is the structure of
relationships among:

One dependent
variable in a single
relationship

Cases/Respondents

What is the
measurement scale
of the dependent
variable?

Cluster analysis

Metric

Non metric

1.Multiple regression
2. conjoint analysis

1.Multiple
discriminant analysis
2. Linear probability

What type of
relationship is being
examined?

Dependence

How many variables
are being predicted?

Multiple relationships
of dependent and
independent variables

Several dependent
variables in single
relationship

Structural equation
modeling

What is the
measurement scale of
the dependent
variable?

Metric

Non metric

What is the
measurement scale of
the predictor variable?

Canonical correlation
analysis with dummy
variables

Metric

Non metric

Canonical correlation
analysis

Multivariate analysis of
variance

What type of
relationship is
being examined?

interdependence

Is the structure of
relationships
among:

Variables

Factor analysis

Objects

How are the
attributes
measured?

Confirmatory
factor analysis

Metric

Non metric

Multidimensional
scaling

Correspondence
analysis

Discriminant analysis
• Discriminant analysis is used when the
dependent variable is a non metric variable and
the independent variables are metric variables.

• Total sample can be divided into groups based
on a qualitative dependent variable.

Discriminant Analysis
• Discriminant analysis also refers to a wider family of
techniques

▫ Still for discrete response, continuous predictors

▫ Produces discriminant functions that classify
observations into groups
 These can be linear or quadratic functions
 Can also be based on non-parametric techniques

Objectives
• To understand group differences and correctly classifying
objects into groups or classes.
• It is used to distinguish innovators from non innovators
according to their demographic and psychographic profiles.
For e.g. Distinguishing males from females, good credit
from poor credit.

Application of discriminant analysis
Research problem
Select objectives:
Evaluate group differences on a multivariate profile
Classify observations into groups
Identify dimensions of discrimination between groups

• Stage
1

Research Design Issues
Selection of independent variables
Sample size considerations
Creation of analysis and holdout samples

• Stage
2

Assumptions
Normality of independent variables
Linearity of relationships
Lack of multicollinearity among independent variables
Equal dispersion matrices

• Stage
3

Estimation of the Discriminant
Functions
Simultaneous or stepwise
estimation
Significance of discriminant
functions

Asses Predictive Accuracy with
Classification Matrices
Determine optimal cutting score
Specify criterion for assessing hit
ratio
Statistical significance of predictive
accuracy

• Stage 4

From
stage 5

Stage 6

Validation of Discriminant Results
Split-sample or cross-validation
Profiling group differences

Stage 5

Interpretation of the Discriminant Functions
How many functions will be interpreted

One

Evaluation of single function
Discriminant weights
Discriminant loadings
Partial F values

Two or more

Evaluation of Separate Functions
Discriminant weights
Partial F values

Evaluation of Combined Functions
Rotation of functions
Potency index
Graphical display of group centroids
Graphical display of loadings

Stage 2: Selecting Dependent And Independent Variables
• First must specify which variables are to be independent
(must be metric) and which variable is to be the
dependent variable (must be non metric).
• The number of dependent variables groups must be
mutually exclusive and exhaustive.
• Dependent variable categories should be different and
unique on the independent variables. Otherwise it will
not be able to uniquely profile each group, resulting in
poor explanation and classification.

Converting metric variables.
• Some situations dependent variable is not true
categorical. We may use ordinal or interval
measurement as a categorical dependent variable by
creating artificial groups.
• Consider using extreme groups to maximize the group
difference.
• This method is called polar extremes approach.

The Independent Variables
• Independent variables are selected in two ways
▫ Identifying variables either from previous research
or from the theoretical model.
▫ Utilizing the researcher’s knowledge.
• In both instances, independent variables must
identify differences between at least two groups of the
dependent variable.

Sample Size
• Overall sample size
▫ Have 20 cases per independent variable, with a minimum
recommended level of 5 observations per variable.
• Sample size per category
▫ The smallest group size of a category must exceed the
number of independent variables.
▫ Wide variations in the group’s size will impact the
estimation of the discriminant function and the
classification of observations.
• To maintain an sufficient sample size both overall and for
each group

Division of the sample
• Discriminant analysis is to divide the sample into two
sub samples
▫ One (analysis sample) for estimation of the
discriminant function
▫ Another (holdout sample) for validation purposes.
 The sample dividing into 50-50 or 60-40 or 75-25
depending on the overall sample size.

 If the categorical groups are equally represented in
the total sample, then the analysis and holdout
sample approximately equal size (i.e. sample
consist of 50 males and 50 females, the analysis
and holdout samples would have 25 males and 25
females.)

 If the original groups are unequal, the sizes of the
analysis and holdout samples should be
proportionate to the total sample distribution. (i.e.
sample contained 70 females and 30 males, then
the analysis and holdout samples would consist
of 35 females and 15 males.)

Stage 3: Assumptions
• The most important assumptions is the equality of the
covariance matrices, which affects both estimation
and classification
IMPACT ON ESTIMATION:
• The sample sizes are small and the covariance
matrices are unequal, then the estimation process is
affected.
• Data not follow the normality assumption can cause
problems in the estimation.

Impact On Classification
• Unequal covariance matrices affect the
classification process.
• The effect can be minimized by increasing the
sample size and also by using the group specific
covariance matrices for classification purpose.
• Multicollinearity among the independent variables
can reduces the estimated impact of independent
variables in the derived discriminant functions,
particularly if a stepwise estimation process is
used.

Stage 4: Estimation of the Model
• Deriving the discriminant function is to choose the
estimation method
▫ 1. The simultaneous (direct) method
▫ 2. The stepwise method
• 1. Simultaneous method:
It involves computing the discriminant function so that
all of the independent variables are considered at the
same time.

2. Stepwise Method:
It involves entering the independent variables into the
discriminant function one at a time on the basis of
their discrimination power.
Step 1 – Choose the single best discriminating variable.
Step 2 – Add the other independent variable with the
initial variable, one at a time, and select the best
variable to improve the discriminating power of the
function in combination with the first variable. Select
additional variable in a manner.
Step 3 - Variables that are not useful in discriminating
between the groups are eliminated.
Stepwise estimation becomes less stable when sample
size decline below the recommended level of 20
observation per independent variable.s

Statistical significance
• The researcher must assess the level of significance of
discriminant function. Significance test can be done on the basis
of number of statistical criteria of 0.05 or beyond is used. If high
level of risk significance level of 0.2 or 0.3 is fixed.
Overall significance:
1.Simultaneous estimation: The measures of Wilk’s
lambda, Hotelling’s trace, and Pillai’s criterion all evaluate the
statistical significance of the discriminant function.
2.Stepwise Estimation: It is used to estimate the discriminant
function, the Mahalanobis D2 and Rao’s V measures are most
appropriate.
Mahalanobis D2 based on squared Euclidean distance that adjusts
for unequal variances.

Assessing overall model fit
• This assessment involves three tasks
▫ Calculating discriminant Z scores for each
observation
▫ Evaluating group differences on the discriminant Z
scores
▫ Assessing group membership prediction accuracy

Calculating Discriminant Z scores
• The discriminant function can be expressed with
either standardized or unstandardized weights and
values.
• The standardized version is more useful for
interpretation purpose.
• The unstandardized version is easier to use in
calculating the discriminant Z score.

Evaluating Group Differences
• The group differences is a comparison of the group
centroids, the average discriminant Z score for all
group members.
• The differences between centroids are measured in
terms of Mahalanobis D2 measure.

Assessing Group membership Prediction
Accuracy
• The dependent variable in nonmetric, it is not possible
measure such as R2 to assess predictive accuracy.
• Rather each observation must be assessed. In doing so,
several major considerations must be addressed:
▫ Developing classification matrices
▫ Cutting score calculation
▫ construction of the classification matrices
▫ Standards for assessing classification accuracy.

Classification matrix
• This is also called prediction matrix.
• The correctly classified cases appear on diagonal because
the predicted and actual group are same.
• Off diagonal represents cases that have been incorrectly
classified.
• The sum of diagonal elements divided by number of cases
represent hit ratio.

Cutting score
• Criterion against which each individual's discriminant Z
score is compared to determine predicted group
membership.

• It represents the dividing point used to classify observation
into one of two groups based on their discriminant function
score.
• Optimal cutting score:
Discriminant Z score value that best separates the groups on
each discriminant function for classification purposes.

Optimal Cutting Score with
Equal Samples Sizes
Group A

Group B

_
ZA
Classify as A
(Nonpurchaser)

_

ZB
Classify as B
(Purchaser)

Optimal Cutting Score with
Unequal Samples Sizes
Optimal Weighted
Cutting Score

Unweighted
Cutting Score

Group B
Group A

_

ZA

_

ZB

Stage 5: Interpretation of the results
Three methods of determining the relative importance of
each independent variable.
• Standardized discriminant weights(coefficients)
• Discriminant loadings ( structure correlations)
• Partial F values

Discriminant weights(coefficient)
• To interpreting discriminant functions examines the
sign and the coefficient assigned to each variable in
computing the discriminant functions.
• Independent variables with larger coefficients
contribute more to the discriminating power of the
function than variables with smaller coefficients.
• The interpretation of discriminant coefficients is
similar to the interpretation of beta coefficients in
regression analysis.

• It is referred as structure correlations.
• Loadings are increasingly used as a basis for interpretation
because of the deficiencies in utilizing coefficients.
• Unique characteristic: Loadings can be calculated for all
variables, whether they were used in the estimation of the
discriminant function or not. Particularly useful in
stepwise estimation.
• Loadings are more valid than coefficients for interpreting
the discriminant power of independent variables because of
their correlation nature.
• Loadings exceeding ±.40 are considered substantive for
interpretation purpose.

Partial F values
• When the stepwise method is selected, use of partial F
values interpreting the discriminant power of the
independent variables.
• Large F values indicate greater discriminatory power.

Validation
• Discriminant loadings are the preferred method to
assess the contribution of each variable to a
discriminant function because they are:
▫ A standardized measure of importance (ranging
from 0 to 1)
▫ Available for all independent variables whether used
in the estimation process or not
▫ Unaffected by multicollinearity
The discriminant function must be validated either
with a holdout sample or one of the “leave one out”
procedures.

Cluster Analysis
 Statistical classification technique in

which cases, data, or objects (events, people, things, etc
.) are sub-divided into groups (clusters) such that
the items in a cluster are very similar (but not
identical) to one another and very different from the
items in other clusters.

Application of Cluster analysis
Research Problem
Select objectives:

Stage 1

Taxonomy description
Data simplification
Reveal relationships
Select clustering variables

Research Design Issues

Stage 2

Can outliers be detected?
Should
the
standardized?

data

be

Stage 2 continue

Select a Similarity Measure
Are the cluster variables metric or
non metric?

Non metric Data:
Metric data

Association of Similarity
Matching coefficients

Is the focus on pattern or
proximity?

Standardization Options
Standardizing variables
Standardizing by observation

Proximity:
Pattern:

Distance Measures of Similarity

correlation Measure of Similarity

Euclidean distance

Correlation coefficient

City-bloc distance
Mahalanobis distance

To stage 3

From stage
2
Assumptions
Is the sample representative of the
population?

Stage 3

Is Multicollinearity substantial enough
to affect results?
Selecting a Clustering Algorithm

Stage 4

Is a hierarchical, nonhierarchical, or
combination of the two methods used?
Hierarchical methods

Nonhierarchical methods

Combination

Linkage methods
available:

Assignment methods available:

Single linkage

Parallel threshold

Complete linkage

Optimization

Average linkage

Selecting seed points

Use a hierarchical
method to specify
cluster seed points
for a non
hierarchical
method

Sequential threshold

Ward’s method
Centroid method

How many clusters are formed?
Examine increases in agglomeration coefficient
Examine dendrogram and vertical icicle plots
Conceptual considerations
Cluster analysis Re-specification
Were any observations deleted as:
Outliers?
Members of small clusters?

No

Yes

From
Stage 4

Stage 5

Interpreting the Clusters
Examine cluster centroids
Name clusters based
clustering variables

on

Validating and Profiling the
Clusters

Stage 6

Validation
with
outcome variables

selected

Profiling
with
additional
descriptive variables

Governing principle

Maximization of homogeneity within clusters
and simultaneously Maximization of heterogeneity
across clusters

Three Basic Questions:
1.

How to measure similarity?

2.

How to form clusters?
(extraction method)

3.

How many clusters?

Answers to First Two Basic Questions:
1.

How to measure similarity?
• Distance – squared Euclidean.

2.

How to form clusters?
• Hierarchical – Wards method.

Third Basic Question: How many clusters?
1. Run cluster; examine solutions for
two, three, four, etc. clusters ??
2. Select number of clusters based on “a
priori” criteria, practical
judgment, common sense, theoretical
foundations, and statistical significance.

Steps in Cluster Analysis:
1. Identify the variables to be clustered.
2. Determine if clusters exist. To do so, verify the
clusters are statistically different and theoretically
meaningful.

3. Make an initial decision on how many clusters to use.
4. Where possible, validate clusters using an external
variable.

5. Describe the characteristics of the derived clusters
using demographics, psychographics, etc.

Objectives of cluster analysis
 Goal of cluster analysis is to partition a set of object into

two or more groups based on the similarity
 Cluster analysis is used for
 Taxonomy description: Identifying natural groups within
the data.
 Data simplification: The ability to analyze groups of
similar observations instead of all individual observation.
 Relationship identification: The simplified structure from
cluster analysis portrays relationships not revealed
otherwise.

Research Design in Cluster
Analysis

•
•
•

Outliers.
Similarity/Distance
Measures.
Standardizing the Data.

Outliers
 In a set of numbers, a number that is much larger or

much smaller than the rest of the numbers is called an
Outlier.
 Outliers" are values that "lie outside" the other values.

Similarity measure
 Three different forms of similarity measures are:
 Correlation Measures (require metric data)


Having widespread application, represent patterns rather
than proximity.

 Distance Measures (require metric data)


Best represents the concept of proximity, which is
fundamental to cluster analysis.

 Association Measures (require nonmetric data)


Represent the proximity of objects across a set of nonmetric
variables.

Types of distance measures
 Euclidean distance
 Squared (or absolute) Euclidean distance
 City – block (Manhattan) distance
 Chebychev distance

 Mahalanobis distance (D2 )

Euclidean distance
Y

*

B

(x2, y2)

y2-y1
A

*
(x1, y1)
x2-x1
X
d =
.

2

2

(x2-x1) + (y2-y1)

 Squared (or absolute) Euclidean distance
 It is the sum of the squared differences without taking the
square root.
 It is the distance measure for the Centroid and Ward’s
methods of clustering.
 City- block distance
 Uses the sum of the absolute differences of the variables
(i.e.) The two sides of a right triangle rather than the
hypotenuse.
 Simplest to calculate, but may lead to invalid clusters if
the clustering variables are highly correlated.

 Chebychev distance
 Another distance measure. It is particularly

susceptible to differences in scales across the
variables.
 Mahalanobis (or correlation) distance (D2 )
 This measure uses the correlation coefficient

between the observations and uses that as a measure
to cluster them.

Standardizing the data
 Cluster analysis using distance measures are quite

sensitive to differing scales or magnitudes among the
variables.
 Distance measures that use unstandardized data the

scale of the variables is changed.

Standardizing the variables
 Common form of standardization is the conversion of

each variable to standard scores (i.e. Z score)
 By subtracting the mean and dividing by the standard

deviation for each variable.
 It eliminates the effects due to scale differences not only
across variables, but for the same variable.
 Clustering variables should be standardized whenever
possible to avoid problems resulting from the use of
different scale values among clustering variables.
 A measure of Euclidean distance that directly incorporates
a standardization procedure is the Mahalanobis distance
(D2 ).

Standardizing by observation
 Standardizing by respondent would standardize each

question not to the sample’s average but instead to that
respondent’s average score.
 Within case or row centering standardization can

be quite effective in removing response style effects
 Standardization provides a remedy to a fundamental

issue in similarity measures and distance measures.

Cluster Analysis Assumptions:

Representative Sample.
•

The cluster analysis is only as good as the
representativeness of the sample. Therefore, all
efforts should be made to ensure that the sample is
representative and the results are generalizable to
the population.

Minimal Multicollinearity.
•

Input variables should be examined for strong
multicollinearity and if present:

•
•

Reduce the variables to equal numbers in each set
of correlated measures, or
Use a distance measure that compensates for the
correlation, such as Mahalanobis distance.

Deriving Clusters and Assessing Overall
Fit
 With the clustering variables selected and the

similarity matrix calculated, the partitioning process
begins.
 The researcher must:
 Select the partitioning procedure used for forming

clusters.
 Make the decision on the number of clusters to be
formed.

Partitioning Procedures
 To maximize the differences between clusters relative to

the variation within the cluster.
 The most widely used procedures can be classified as
 Hierarchical

 Nonhierarchical

Hierarchical

Non overlapping

Non-hierarchical

Agglomerative

Divisive

1a
1b

1a
1c

2

1b

1b2

1b1

Overlapping

Hierarchical Clustering
 Two main types of hierarchical clustering


Agglomerative:
 Start with the points as individual clusters
 At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left



Divisive:
 Start with one, all-inclusive cluster
 At each step, split a cluster until each cluster contains a point
(or there are k clusters)

 Traditional hierarchical algorithms use a similarity or distance matrix


Merge or split one cluster at a time

Agglomerative

OBS 1 *
OBS 2
Step 0:
Each observation
is treated as a
separate cluster

*

OBS 3

*

OBS 4 *
OBS 5

Distance Measure

*

OBS 6 *

0,2

0,4

0,6
Divisive

0,8

1,0

Cluster 1

OBS 1 *
OBS 2

Step 1:
Two observations
with smallest
pairwise distances
are clustered

*

OBS 3

*

OBS 4 *
OBS 5

*

OBS 6 *

0,2

0,4

0,6

0,8

1,0

Cluster 1

OBS 1 *
OBS 2

Step 2:
Two other observations
with smallest distances
amongst remaining
points/clusters
are clustered

Cluster 2

*

OBS 3

*

OBS 4 *
OBS 5

*

OBS 6 *

0,2

0,4

0,6

0,8

1,0

Cluster 1

OBS 1 *
OBS 2
OBS 3

Step 3:
Observation 3
joins
with cluster 1

Cluster 2

*

*

OBS 4 *
OBS 5

*

OBS 6 *

0,2

0,4

0,6

0,8

1,0

OBS 1 *
OBS 2

*

OBS 3

*

“Supercluster”

OBS 4 *

Step 4:
Cluster 1 and 2 - from
Step 3 joint into a
“Supercluster”

OBS 5

*

OBS 6 *

0,2
A single observation
remains unclustered (Outlier)

0,4

0,6

0,8

1,0

Five Most Agglomerative Algorithms
 Single linkage: Smallest distance from any object in one cluster to
any object in the other.

 Complete linkage: Largest distance between an observation in one
cluster and an observation in the other.

 Average linkage: Average distance between an observation in one
cluster and an observation in the other.

 Centroid Method: Distance between the centroids of two clusters.
 Ward’s Method: Between two clusters is the difference between the
total within cluster sum of squares for the two clusters separately, and
the within cluster sum of squares resulting from merging the two clusters
in cluster
88

Agglomerative Algorithms
*

*
*

Single Linkage:

*

*
*

Average Linkage:

Average distance

*
* ¤
*
*
*

*

*

¤*

*

*

Minimum distance

* *
* ¤
* *

Wards method:
*

* *
¤ *
* *

Minimization of
within-cluster variance

Complete Linkage:

Maximum distance

Centroid method:

Distance between
centres

Single linkage
Minimize shortest distance from cluster to point
A*

*G

*B
7,0

C *

H*

8,5

*D

*E

Complete linkage
Minimize longest distance from cluster to point

A*

*G

*B
10,5

C *

*D
9,5

H*

*E

Average linkage
Minimize average distance from cluster to point

A*

*G

*B
8,5

C *

9,0

H*

*D

*E

Hierarchical Clustering: Comparison
1

3
5

5

4

5

2

2

5

1

2

1

MIN

3

2

MAX

6

3

3

4

4

5

5

2

4

1
4

1
5

6

4

1

2
Ward’s Method

2
3

3

6

5

2

Group Average

3

1
4

6
1

4

3

Non Hierarchical Clustering
 Non hierarchical clustering is also called K- means

clustering.
 The process essentially has two steps:
 Specify cluster seeds:


To identify starting points for each cluster known as cluster
seeds. It is selected in a random process.

 Assignment


To assign each observation to one of the cluster seeds based on
similarity.

Selecting seed points
 How do we select the cluster seeds?

Classified into two basic categories:
Researcher specified:
 The researcher provides the seed points based
on external data. The two common sources of
the seed points are prior research or data from
another multivariate analysis.
 Sample generated:
 To generate the cluster seeds from the
observations of the sample , either in
systematic or random selection


Non Hierarchical Clustering Algorithms
 Sequential Threshold method - first determine a

cluster center, then group all objects that are within a
predetermined threshold from the center - one cluster is
created at a time
 Parallel Threshold method - simultaneously several
cluster centers are determined, then objects that are
within a predetermined threshold from the centers are
grouped
 Optimizing Partitioning method - first a nonhierarchical procedure is run, then objects are
reassigned so as to optimize an overall criterion.

Pros and Cons of Hierarchical Methods
 It is more popular clustering method with Ward’s

method and average linkage.
 Advantages:
 Simplicity

 Measures of similarity
 Speed

Disadvantages
 Hierarchical methods can be misleading because

undesirable early combinations may persist throughout
the analysis and lead to artificial results.
 To reduce the impact of outliers, the research analyze
the data several times, each time deleting problem
observations or outliers

Combination of Both Methods
 Hierarchical technique is used to generate a complete

set of cluster solutions, establish the cluster solutions,
profile cluster centers to act as cluster seed points, and
identify outliers.
 After outliers are eliminated, remaining observation
can be clustered by a non hierarchical method with the
cluster centers from the hierarchical results acting as
the initial seed points.

Decision on the number of cluster to be
formed
 Performing either a hierarchical or non hierarchical

cluster analysis is determining the number of clusters.

 Decision is critical for hierarchical techniques because

the researcher must select the cluster solution to
represent the data structure (called stopping rule).

Interpretation of the clusters
 The cluster Centroid (i.e. a mean of the cluster) is

particularly useful in interpretation.
 It involves calculate the distinguishing characteristics of

each clusters and identifying differences between
clusters.
 If cluster solutions fail to show large variation then the
other cluster solutions should be calculate.
 The cluster centroid should be assessed based on theory
or practical experience.

Validation
 Validation is essential in cluster analysis because the

clusters are descriptive of structure and require additional
support for their relevance.
 Cross- validation
 To cluster analyze separate samples
 Comparing the cluster solutions
 Assessing the correspondence of the results.

In these instances, a common approach is
 By creating two subsamples (randomly splitting the sample) and
then comparing the two cluster solutions for consistency with
respect to number of clusters and the cluster profiles.

Inferring Gene Functionality
 Researchers want to know the functions of new genes
 Simply comparing the new gene sequences to known DNA

sequences often does not give away the actual function of
gene
 For 40% of sequenced genes, functionality cannot be
ascertained by only comparing to sequences of other known
genes
 Microarrays allow biologists to infer gene function even
when there is not enough evidence to infer function based
on similarity alone

Microarray Analysis
 Microarrays measure the activity (expression level)
of the gene under varying conditions/time points
 Expression level is estimated by measuring the
amount of mRNA for that particular gene
 A gene is active if it is being transcribed
 More mRNA usually indicates more gene
activity

Microarray Experiments
 Analyze mRNA produced from cells in the tissue with the








environmental conditions you are testing
Produce cDNA from mRNA (DNA is more stable)
Attach phosphor to cDNA to see when a particular gene is
expressed
Different color phosphors are available to compare many
samples at once
Hybridize cDNA over the micro array
Scan the microarray with a phosphor-illuminating laser
Illumination reveals transcribed genes
Scan microarray multiple times for the different color
phosphor’s

Using Microarrays

• Track

the sample
over a period of time
to see gene
expression over time
•Track two different
samples under the
same conditions to
see the difference in
gene expressions
Each box represents
one gene’s expression
over time

Using Microarrays (cont’d)
 Green: expressed only from

control
 Red: expresses only from
experimental cell
 Yellow: equally expressed in
both samples
 Black: NOT expressed in
either control or experimental
cells

Microarray Data
 Microarray data are usually transformed into an

intensity matrix (below)
 The intensity matrix allows biologists to make
correlations between diferent genes (even if they are
dissimilar) and to understand how genes functions
might be related
 Clustering comes into play
Time:

Time Y

Time Z

Gene 1

Intensity (expression
level) of gene at
measured time

Time X
10

8

10

Gene 2

10

0

9

Gene 3

4

8.6

3

Gene 4

7

8

3

Gene 5

1

2

3

Microarray Data
 Microarray data are usually transformed into an

intensity matrix (below)
 The intensity matrix allows biologists to make
correlations between different genes (even if they are
dissimilar) and to understand how genes functions
might be related
 Clustering comes into play
Time:

Time Y

Time Z

Gene 1

Intensity (expression
level) of gene at
measured time

Time X
10

8

10

Gene 2

10

0

9

Gene 3

4

8.6

3

Gene 4

7

8

3

Gene 5

1

2

3

Clustering of Microarray Data
 Plot each datum as a point in N-dimensional space
 Make a distance matrix for the distance between every

two gene points in the N-dimensional space
 Genes with a small distance share the same expression
characteristics and might be functionally related or
similar!
 Clustering reveal groups of functionally related genes

Hierarchical clustering
Step 1: Transform genes * experiments matrix into
genes * genes distance matrix
Exp 1

Exp 2

Exp 3

Gene A

Exp 4

Gene A
Gene B
Gene C

Step 2: Cluster genes
based on distance matrix
and draw a dendrogram
until single node remains

Gene A
Gene B
Gene C

Gene B

Gene C

0
?
?

0
?

0

Data and distance matrix
Genes

A
B
C
D
E

A
0.0

Patients
A
B
C
D
E
B
223.6
0.0

1
90
190
90
200
150

2
190
390
110
400
200
C
80.0
297.3
0.0

D
237.1
14.1
310.2
0.0

E
60.8
194.2
108.2
206.2
0.0

Hierarchical clustering (continued)
G1
G2
G3
G4
G5

G1
0
2
6
10
9

G2
0
5
9
8

G3

0
4
5

G4

0
3

G5

G (12)
G3
G4
G5

G (12)
0
6
10
9

2 3 4

G4

G5

0
4
5

0
3

0

0

G (12)
G3
G (45)

1

G3

5

Stage
P5
P4
P3
P2
P1

G (12)
0
6
10

G3

G (45)

0
5

0

Groups
[1], [2], [3], [4], [5]
[1 2], [3], [4], [5]
[1 2], [3], [4 5]
[1 2], [3 4 5]
[1 2 3 4 5]

Clustering of Microarray Data (cont’d)

Clusters

Hierarchical Clustering: Example








Factor analysis is an interdependence technique.
Analyzing the correlation among a large number of
variables.
To summarize the information with a minimum loss of
information.
In factor analysis, we group variables by their
correlations, such that variables in a group (factor) have
high correlations with each other.

• Research problem
▫ Is the analysis exploratory or confirmatory?
▫ Select objectives:
 Data summarization
 Data reduction
• Confirmatory
▫ Structural equation modeling
• Exploratory
▫ Select the type of Factor Analysis
What is being grouped – variables or cases?
• Cases
▫ Q – type factor analysis or cluster analysis.

• Variables
▫ R – type factor analysis.

• Research design
▫ What variables are included?
▫ How are the variables measured?
▫ What is the desired sample size?

• Assumptions
▫ Statistical considerations of normality, linearity, and
homoscedasticity.
▫ Homogeneity of sample
▫ Conceptual linkages

• Selecting a Factor Method
▫ Is the total variance or only common variance
analyzed?

• Total variance
▫ Extract factors with component analysis

• Common variance
▫ Extract factors with common factor analysis

• Specifying the Factor Matrix
▫ Determine the number of factors to be retained

• Selecting a Rotational Method
▫ Should the factors be correlated (oblique) or uncorrelated (orthogonal)?
• Orthogonal Methods
▫ VARIMAX
▫ EQUIMAX
▫ QUARTIMAX
• Oblique Methods
▫ Oblimin
▫ Promax
▫ Orthoblique
• Interpreting the Rotated Factor Matrix
▫ Can significant loadings be found?
▫ Can factors be named?
▫ Are communalities sufficient?
 If no, selecting a factor method
 If yes, go to factor model respecification

• Factor model respecification
▫ Were any variables deleted?
▫ Do you want to change the number of factors?
▫ Do you want another type of rotation?
If yes, selecting a Factor Method
If no, go to validation of the factor matrix

• Validation of the Factor Matrix
▫ Split/multiple samples
▫ Separate analysis for subgroups
▫ Identify influential cases

• Additional Uses
▫ Selection of Surrogate Variables
▫ Computation of Factor Scores
▫ Creation of Summated Scales

• Factor – summarize the original set of observed
variables
• Factor loadings – correlation between original
variables and the factors.

• Squared factor loadings – percentage of the
variance in an original variable is explained by a
factor.

Communality

• In factor Analysis ,a measure of the percentage of
a variable’s variation that is explained by the
factors .
• A relative high communality indicates that a
variable has much in common with the other
variables taken as a group.

Specifying the Unit of Analysis
• First select the unit of analysis for factor analysis
▫ Variables (or)
▫ Respondents
• Factor analysis would be applied to a correlation matrix of
the variables.
o R factor analysis – common type of factor analysis for
variables
• Factor analysis also may be applied to a correlation matrix of
the individual respondents based on their characteristics.
o Q factor analysis – Respondents
o Q factor analysis is not utilized frequently because of difficult to
calculate.
o Instead , some type of cluster analysis is used to group individual
respondents.

Data summarization
• Explain the data in a smaller number of concepts
that equally represent the original set of variables.

Variable selection
• Whether factor analysis is used for data reduction
or summarization, should consider the
conceptual basis of the variables.
• In assessing the dimension of store image, if no
question on store workers were included, factor
analysis would not be able to identify this
dimension.

• Factor analysis is the “garbage in , garbage out”
phenomenon.
• If the researcher includes a large number of
variables and hopes that factor analysis will
“figure it out,” then there is a high possibility of
poor results .

Designing a factor analysis
• Factor analysis involves three basic decisions:
▫ Correlation among variables or respondents
▫ Variables selection and measurement issues
▫ Sample size

Correlations among variables or
respondents
• Two forms of factor analysis. Both utilize a correlation
matrix as the basic data input.
▫ R type - use a traditional correlation matrix as input.
▫ Q type – use a factor matrix that would identify similar
individuals.
Q factor analysis is different from cluster analysis.
Q – type factor analysis form grouping based on the
intercorrelations between the respondents.
Cluster analysis form grouping based on a distance based
similarity measure.

Variable selection and measurement
issues
• Two specific questions must be answered:
▫ What type of variables can be used in factor analysis?
▫ How many variables should be included?

• Correlations are easy to find in metric variables.
• Non metric variables are more problematic .
• To define dummy variables (coded 0-1) to represent
categories of non metric variables then correlation is
possible to find.
• Boolean factor analysis are more appropriate if all the
variables are dummy variables.
• If a study is being designed to reveal factor structure, strive
to have at least five variables for each proposed factor.

Sample size
• For sample size:
▫ The sample must have more observations than
variables.
▫ The minimum absolute sample size should be 50
observations.

• Maximize the number of observations per
variable, with a minimum of 5 and at least 10
observation per variable.

Need of factor analysis
• The difficulties in a having too many
independent variables in predicting the
response variable are :
▫
▫
▫
▫
▫

Increased computational time to get solution .
Increased time in data collection
Too much expenditure in data collection
Presence of redundant independent
Difficulty in making inference .

• These can be avoided using Factor Analysis

• Factor analysis aims at grouping the original input
variables into factors which underlie the input variables

• The total no of factors = total no of input variables
But after performing Factor Analysis
• The total no of factors in the study can be reduced by
dropping the insignificant factors based on Certain Criteria

Objective of factor analysis
• The main objective of Factor analysis is to
summarize a large number of underlying factors into a
smaller number of variables or factors which represent the
basic factors underlying the data.
• Factor analysis is used to uncover the latent
structure(dimensions) of a set of variables.
• It reduces attribute space from a larger number of variables
to a smaller number of factors and as such is a
“nondependent" procedure (that is, it does not assume a
dependent variable is specified).

Assumptions
• Factor analysis is designed for interval data, although it
can also be used for ordinal data
• The variables used in factor analysis should be linearly
related to each other. This can be checked by looking at
scatter plots of pairs of variables.
• Obviously the variables must also be at least moderately
correlated to each other, otherwise the number of factors
will be almost the same as the number of original
variables, which means that carrying out a factor analysis
would be pointless.

Method of determining the
appropriateness of factor analysis
• If correlations is not greater than 0.30 then factor analysis
is probably in appropriate.
• The correlations among variables can also be analyzed by
computing the partial correlations among variables.
• If partial correlations are high then factor analysis is
inappropriate.
• Partial correlation should be small, because the variable
can be explained by the variables loading on the factors.

• Bartlett test of sphericity:
▫ It is a statistical test for the presence of correlations among the
variables.
▫ A statistically significant Bartlett’s test of sphericity (sig >0.50)
indicates that sufficient correlations exist among the variables to
proceed.

• Measure of sampling adequacy (MSA):
o MSA value must exceed 0.50 for both the overall test and each
individual variable.
o Variables with values less than 0.50 should be omitted from the
factor analysis one at a time, with the smallest one being omitted
each time.

• The MSA increases as:
o
o
o
o

The sample size increases
The average correlations increase
The number of variables increases
The number of factors decreases.

Selecting a Factor extraction method
• Before selecting the methods of factor extraction, must
have some understanding of the variance for a variables
and how it is divided or partitioned.

• For the purpose of factor analysis, it is important to
understand how much of a variables variance is shared with
other variables.

• The total variance of any variable can be divided into three
types of variance.
▫ Common variance: Variance in a variable that is shared
with all other variables in the analysis.
▫ Specific variance (unique variance): variance associated
with only a specific variable. This variance cannot be
explained by the correlation to the other variables
▫ Error variance: It is also variance that cannot be
explained by correlations with other variables.
 As a variable is more highly correlated with one or more
variables, the common variances (communality)
increases.
 Unreliable measures or other sources of error variance are
introduced, then the common variance is reduced.

Factor analysis Vs principal component
analysis
Factor analysis

Principal component analysis

• It analyzes only the variance
shared among the variables
(common variance without
error or unique variance).
• It adjusts the diagonals of the
correlation matrix with the
unique factors.
• The component score in PCA
represent a linear combination
of the observed variables
weighted by Eigen vectors.
• PCA do not represent
underlying constructs.

• It analyzes total variance.
• It inserts 1’s on the diagonals of
the correlation matrix.
• The observed variables in FA are
linear combination of the
underlying and unique factors.
• FA underlying constructs can be
labeled and readily interpreted,
given an accurate model
specification.
• Both models yield similar
results if the number of variables
exceeds 30 or the
communalities exceed 0.60.

Number of factors to extract
• Any decision on the number of factors to be
retained should be based on several considerations:
▫ Factors with Eigen values greater than 1.0
▫ A predetermined number of factors based on
research objectives and prior research
▫ Enough factors to meet a specified percentage of
variance explained, usually 60% or higher.
▫ Factors shown by the scree test to have substantial
amounts of common variance.

3.0

Scree Plot

Eigenvalue

2.5
2.0
1.5
1.0

0.5
0.0
1

2

3
4
5
Component Number

6

Interpreting the factors
• The three processes of factor interpretation
▫ Estimate the factor matrix
 First unrotated factor matrix is computed, containing
the factor loadings for each variable on each factor.

▫ Factor rotation
▫ Factor interpretation and respecification

150

Rotating factors
• When the factors are extracted, factor loading is
obtained. Factor loadings are the correlation of each
variable and the factor. When rotating the factors, the
variance has been redistributed so that the factor
loading pattern and percentage of variance for each of
the factors is difference.
• The objectives of rotating is to redistribute the variance
from earlier factors to later ones to achieve a
simple, theoretically more meaningful factor
pattern, and make the result easily to be interpreted
• Two type of factor rotation
1. Orthogonal factor rotation
2. Oblique facto rotation

Orthogonal factor rotation
• Orthogonal rotation the axes are maintained at right
angles. The objective of all methods of rotation is to
simplify the rows and columns of the factor matrix.
• By simplifying the rows, making as many values in
each row as close to zero as possible.
• By simplifying the columns, making as many values in
each column as close to zero as possible.
There are three rotation methods
• 1) Quartimax,
• 2) Varimax,
• 3) Equimax.

Orthogonal factor rotation
Unrotated factor II

Rotated factor II

+1.0
V1
V2
+.50

-1.0

-.50

0

+.50

+1.0

V4
-.50

-1.0

V5

Unrotated
factor I

V3

Rotated
factor I

153

1.The quartimax rotation is to simplify the rows of a factor
matrix, i.e. focus on rotating the initial factor so that a
variable loads high on one factor and as low as possible on
all other factor.
2. The varimax rotation is to simplify the columns of the
factor matrix. With this approach, the maximum possible
simplification is reached of there are only 1’s and 0’s in a
column.
3.The equimax rotation is a compromise between the
quartimax and varimax.
In practice, the first two are the most common one to apply.

Oblique rotation methods
• Oblique rotations are similar to orthogonal rotation.
• It allow correlated factors instead of maintaining
independence between the rotated factors.
• Oblique rotation the axes need not be maintained at
right angle.
• It represents the clustering of variables more
accurately.
• There are three rotation methods
▫ Oblimin
▫ Promax
▫ orthoblique

Oblique factor rotation
Unrotated factor II
+1.0

Orthogonal Rotated factor II
V1

Oblique rotation factor II
V2

+.50

-1.0

-.50

0

+.50

+1.0

Unrotated
factor I

V3
V4
-.50

-1.0

V5

Oblique rotation I
Orthogonal
Rotated
factor I

Assessing factor analysis
• In interpreting factors, a decision must be made regarding
the factor loadings worth consideration and attention.

• Loadings exceeding 0.70 are considered indicative of welldefined structure and are the goal of any factor analysis.

Interpreting a factor matrix
• Step 1: Examine the factor matrix of loadings.
▫ The factor loading matrix contains the factor loading of
each variable on each factor.
▫ Rotated loadings are usually used in factor interpretation
unless data reduction is the sole objective.
▫ An oblique rotation has been used, two matrices of factor
loadings are provided.
 Factor pattern matrix
 Factor structure matrix

 Factor pattern matrix
 Represent the unique contribution of each variable to the
factor.

 Factor structure matrix
 Simple correlation between variables and factors, but
these loadings contain both the unique variance between
variables and factors and the correlation among factors.

Validation of factor analysis
Assessing the degree of generalizability of the results to the
population and potential influence of individual cases on
the overall results.
• Use of a confirmatory perspective:
▫ The direct method of validating the results is to use a
confirmatory perspective.
▫ Assess the replicability of the results either with a split
sample in the original data set or with a separate sample.

Assessing factor structure stability
• Factors stability is primarily dependent on the sample size
and on the number of cases per variable.

• Comparison of the two resulting factor matrices will
provide an assessment of the robustness of the solution.

Detecting influential observations
• The another issue to the validation of factor
analysis is the detecting influential observations.

• To estimate the model with and without
observations identified as outliers to assess their
impact on the results.

Additional uses of factor analysis results
• Objective is
▫ To identify logical combinations of variables and better
understand the interrelationships among variables, then
factor interpretation will enough.
▫ To identify appropriate variables for subsequent application to
other statistical techniques, then some form of data reduction
will be used.
▫ There are three option for data reduction
 Summated scale
 surrogate variable
 Factor score

Summated Scales
• One of the common uses of factor analysis is the formation
of summated scales, where we add the scores on all the
variables loading on a component to create the score for
the component.
• To verify that the variables for a component are measuring
similar entities that are legitimate to add together, we
compute Chronbach's alpha.

• If Chronbach's alpha is 0.70 or greater (0.60 or greater for
exploratory research), we have support on the interval
consistency of the items justifying their use in a summated
scale.

Surrogate variable
• The option of examining the factor matrix and
selecting the variable with the highest factor loading
on each factor to act as a surrogate variable.
• The selection process is more difficult because two or
more variables have loadings that are significant and
close to each other.
• Disadvantages of selecting a single surrogate variable
▫ It does not address the issue of measurement error.
▫ It also runs the risk of potentially misleading results by
selecting only a single variable to represent a more
complex result.
Factor analysis calculating a summated scale or factor
scores instead of the surrogate variable.

Factor Score
• A number that represents each observations calculated value on
each factor in a factor analysis.
• At the initial stage ,the respondents assign scores for the
variables. After performing factor analysis, each factor assigns a
score for each respondent. Such score are called respondent
factor score.
• Factor scores are standardized to have a mean of ‘0’ and a
standard deviation of ‘1’.

Multivariate

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Multivariate

Similar to Multivariate (20)

Recently uploaded

Recently uploaded (20)

Multivariate