SlideShare a Scribd company logo
1 of 165
Multivariate Analysis

Analysis of multiple variables in a
single relationship or set of
relationships.

Some basic concepts of
multivariate analysis
The Variate
 Measurement scales
 Measurement error and multivariate
measurement.
 Statistical significance Vs Statistical
power

1.The Variate
Variate is also called linear
combination.
 The linear combination of variables
with empirically determined weights.
Variate value = w1x1 + w2x2 + ...+ wnxn
x1 ,x2 ,..xn = Observed variable
w1 , w2 ,.. wn = Weight
 The variables are specified by the
researcher.
 The weights are determined by the

2.Measurement Scales
Data analysis involves the
identification and measurement of
variation in a set of variables.
 The researcher cannot identify
variation unless it can be measured.
 Measurement is important for
representing the concept and selection
of appropriate multivariate method of
analysis.

Measurement scales can be classified into
two categories


Non metric ( Qualitative)

 Metric (Quantitative)
Non metric measurement scales




Also called qualitative data. Measures that
describe by indicating the presence or
absence of characteristic or property are
called non metric data.
For e.g.: If a person is male, he cannot be
female. An “ amount” of gender is not
possible, just state of being male or female.
Qualitative measurements can be made
with either a nominal or an ordinal scale.
Nominal Scales
A nominal scale assigns numbers to
identify subjects or objects.
 Nominal scales also known as
categorical scales.
For e.g. In representing gender the
researcher might assign numbers to
each category
(i.e) Assign 2 for females
Assign 1 for males.

Ordinal Scales


In ordinal scales, variables can be ordered or ranked.



Every subject or object can be compared with another
in terms of a “greater than” or “less than” relationship.



It provide only the order of the value, but not measure
of the actual amount.
For e.g. opinion scales that go from “most important”
to “least important” or “strongly agree” to “strongly
disagree”.


Using ordinal scale we cannot find the
amount of the differences between
products.
(i.e.) we cannot answer the question
whether the difference between Product
A and B is greater than the difference
between Product B and C.
Metric Measurement Scales
Metric data also called quantitative
data, these measurements identify subjects
not only on the possession of an attribute
but also by the amount to which the
subject may be characterized by the
attribute.
For e.g. age, height.
 Two metric measurement scales are
 Interval scales
 Ratio scales

Interval scales
Data of real numbers, numbers with a
zero point and can be divided and
compared into other ratio numbers.
For e.g. income, weight, height and age.
 We can answer the question whether the
difference between Product A and B is
greater than the difference between
Product B and C.

MEASUREMENT ERROR AND
MULTIVARIATE
MEASUREMENT
Measurement Error


Not measuring the “true” variable values accurately
due to the inappropriate response scales, data entry
error, or respondent errors.



For e.g.
1. Imposing 7-point rating scales for attribute
measurement when the researcher knows the
respondents can accurately respond only to a 3- point
rating .



2. Responses as to household income may be
reasonably accurate but rarely totally precise.
All variables used in multivariate techniques must be
assumed to have some degree of measurement error.
Validity


Validity – extent to which a measure correctly
represents the concept of study. (i.e.) the
degree to which it is free form any systematic
or nonrandom error.



validity relates not to what should be
measured, but instead to how it is measured
Reliability
Reliability is the degree to which the observed
variable measures the “true” value and is
“error free”.
 More reliable measures will show greater
consistency than less reliable measures.
 Choose the variable with the higher
reliability.
 Validity is concerned with how well the
concept is defined by the measures, whereas
reliability relates to the consistency of the
measures.

Multivariate measurement


Use of two or more variables as indicators(i.e. single
variable used in conjunction with one or more other
variables to form a composite measure) of a single
composite measure.



For e.g. A personality test may provide the answers
to a series of individual questions, which are then
combined to form a single score representing the
personality trait.
Statistical Significance


All multivariate techniques are based on the
statistical inference of a population or a randomly
drawn sample of that population.



Interpreting statistical inferences, the researcher
specify the acceptable levels of statistical error.



5% or 1% Level of significance – it means 95% certain
that our sample results are not due to chance.
Types of statistical error
H0 is true
Accept
Reject


H0 is false

1-α

β
Type 2 error

α
Type 1 error

1-β
Power

There are two types of errors
Type 1 error
Type 2 error


Type 1 error is the probability of rejecting
the null hypothesis when it is true. It is also
known as producer’s risk. Probability of
type 1 error is alpha(α).



Type 2 error is the probability of failing to
reject the null hypothesis when it is false. It
is also known as consumer’s risk.
Probability of type 2 error is beta(β).
Statistical power
 Power is the probability of correctly rejecting the null
hypothesis when it is false.


Power of statistical inference test is 1-β



Increased sample sizes produce greater power of the
statistical test.



Researchers should always design the study to
achieve a power level of 0.80 at the desired
significance level.
What type of
relationship is being
examined?

Dependence

Interdependence

How many variables
are being predicted?

Is the structure of
relationships among:

One dependent
variable in a single
relationship

Cases/Respondents

What is the
measurement scale
of the dependent
variable?

Cluster analysis

Metric

Non metric

1.Multiple regression
2. conjoint analysis

1.Multiple
discriminant analysis
2. Linear probability
What type of
relationship is being
examined?

Dependence

How many variables
are being predicted?

Multiple relationships
of dependent and
independent variables

Several dependent
variables in single
relationship

Structural equation
modeling

What is the
measurement scale of
the dependent
variable?

Metric

Non metric

What is the
measurement scale of
the predictor variable?

Canonical correlation
analysis with dummy
variables

Metric

Non metric

Canonical correlation
analysis

Multivariate analysis of
variance
What type of
relationship is
being examined?

interdependence

Is the structure of
relationships
among:

Variables

Factor analysis

Objects

How are the
attributes
measured?

Confirmatory
factor analysis

Metric

Non metric

Multidimensional
scaling

Correspondence
analysis
Discriminant analysis
• Discriminant analysis is used when the
dependent variable is a non metric variable and
the independent variables are metric variables.

• Total sample can be divided into groups based
on a qualitative dependent variable.
Discriminant Analysis
• Discriminant analysis also refers to a wider family of
techniques

▫ Still for discrete response, continuous predictors

▫ Produces discriminant functions that classify
observations into groups
 These can be linear or quadratic functions
 Can also be based on non-parametric techniques
Objectives
• To understand group differences and correctly classifying
objects into groups or classes.
• It is used to distinguish innovators from non innovators
according to their demographic and psychographic profiles.
For e.g. Distinguishing males from females, good credit
from poor credit.
Application of discriminant analysis
Research problem
Select objectives:
Evaluate group differences on a multivariate profile
Classify observations into groups
Identify dimensions of discrimination between groups

• Stage
1

Research Design Issues
Selection of independent variables
Sample size considerations
Creation of analysis and holdout samples

• Stage
2

Assumptions
Normality of independent variables
Linearity of relationships
Lack of multicollinearity among independent variables
Equal dispersion matrices

• Stage
3
Estimation of the Discriminant
Functions
Simultaneous or stepwise
estimation
Significance of discriminant
functions

Asses Predictive Accuracy with
Classification Matrices
Determine optimal cutting score
Specify criterion for assessing hit
ratio
Statistical significance of predictive
accuracy

• Stage 4
From
stage 5

Stage 6

Validation of Discriminant Results
Split-sample or cross-validation
Profiling group differences
Stage 5

Interpretation of the Discriminant Functions
How many functions will be interpreted

One

Evaluation of single function
Discriminant weights
Discriminant loadings
Partial F values

Two or more

Evaluation of Separate Functions
Discriminant weights
Discriminant loadings
Partial F values

Evaluation of Combined Functions
Rotation of functions
Potency index
Graphical display of group centroids
Graphical display of loadings
Stage 2: Selecting Dependent And Independent Variables
• First must specify which variables are to be independent
(must be metric) and which variable is to be the
dependent variable (must be non metric).
• The number of dependent variables groups must be
mutually exclusive and exhaustive.
• Dependent variable categories should be different and
unique on the independent variables. Otherwise it will
not be able to uniquely profile each group, resulting in
poor explanation and classification.
Converting metric variables.
• Some situations dependent variable is not true
categorical. We may use ordinal or interval
measurement as a categorical dependent variable by
creating artificial groups.
• Consider using extreme groups to maximize the group
difference.
• This method is called polar extremes approach.
The Independent Variables
• Independent variables are selected in two ways
▫ Identifying variables either from previous research
or from the theoretical model.
▫ Utilizing the researcher’s knowledge.
• In both instances, independent variables must
identify differences between at least two groups of the
dependent variable.
Sample Size
• Overall sample size
▫ Have 20 cases per independent variable, with a minimum
recommended level of 5 observations per variable.
• Sample size per category
▫ The smallest group size of a category must exceed the
number of independent variables.
▫ Wide variations in the group’s size will impact the
estimation of the discriminant function and the
classification of observations.
• To maintain an sufficient sample size both overall and for
each group
Division of the sample
• Discriminant analysis is to divide the sample into two
sub samples
▫ One (analysis sample) for estimation of the
discriminant function
▫ Another (holdout sample) for validation purposes.
 The sample dividing into 50-50 or 60-40 or 75-25
depending on the overall sample size.
 If the categorical groups are equally represented in
the total sample, then the analysis and holdout
sample approximately equal size (i.e. sample
consist of 50 males and 50 females, the analysis
and holdout samples would have 25 males and 25
females.)

 If the original groups are unequal, the sizes of the
analysis and holdout samples should be
proportionate to the total sample distribution. (i.e.
sample contained 70 females and 30 males, then
the analysis and holdout samples would consist
of 35 females and 15 males.)
Stage 3: Assumptions
• The most important assumptions is the equality of the
covariance matrices, which affects both estimation
and classification
IMPACT ON ESTIMATION:
• The sample sizes are small and the covariance
matrices are unequal, then the estimation process is
affected.
• Data not follow the normality assumption can cause
problems in the estimation.
Impact On Classification
• Unequal covariance matrices affect the
classification process.
• The effect can be minimized by increasing the
sample size and also by using the group specific
covariance matrices for classification purpose.
• Multicollinearity among the independent variables
can reduces the estimated impact of independent
variables in the derived discriminant functions,
particularly if a stepwise estimation process is
used.
Stage 4: Estimation of the Model
• Deriving the discriminant function is to choose the
estimation method
▫ 1. The simultaneous (direct) method
▫ 2. The stepwise method
• 1. Simultaneous method:
It involves computing the discriminant function so that
all of the independent variables are considered at the
same time.
2. Stepwise Method:
It involves entering the independent variables into the
discriminant function one at a time on the basis of
their discrimination power.
Step 1 – Choose the single best discriminating variable.
Step 2 – Add the other independent variable with the
initial variable, one at a time, and select the best
variable to improve the discriminating power of the
function in combination with the first variable. Select
additional variable in a manner.
Step 3 - Variables that are not useful in discriminating
between the groups are eliminated.
Stepwise estimation becomes less stable when sample
size decline below the recommended level of 20
observation per independent variable.s
Statistical significance
• The researcher must assess the level of significance of
discriminant function. Significance test can be done on the basis
of number of statistical criteria of 0.05 or beyond is used. If high
level of risk significance level of 0.2 or 0.3 is fixed.
Overall significance:
1.Simultaneous estimation: The measures of Wilk’s
lambda, Hotelling’s trace, and Pillai’s criterion all evaluate the
statistical significance of the discriminant function.
2.Stepwise Estimation: It is used to estimate the discriminant
function, the Mahalanobis D2 and Rao’s V measures are most
appropriate.
Mahalanobis D2 based on squared Euclidean distance that adjusts
for unequal variances.
Assessing overall model fit
• This assessment involves three tasks
▫ Calculating discriminant Z scores for each
observation
▫ Evaluating group differences on the discriminant Z
scores
▫ Assessing group membership prediction accuracy
Calculating Discriminant Z scores
• The discriminant function can be expressed with
either standardized or unstandardized weights and
values.
• The standardized version is more useful for
interpretation purpose.
• The unstandardized version is easier to use in
calculating the discriminant Z score.
Evaluating Group Differences
• The group differences is a comparison of the group
centroids, the average discriminant Z score for all
group members.
• The differences between centroids are measured in
terms of Mahalanobis D2 measure.
Assessing Group membership Prediction
Accuracy
• The dependent variable in nonmetric, it is not possible
measure such as R2 to assess predictive accuracy.
• Rather each observation must be assessed. In doing so,
several major considerations must be addressed:
▫ Developing classification matrices
▫ Cutting score calculation
▫ construction of the classification matrices
▫ Standards for assessing classification accuracy.
Classification matrix
• This is also called prediction matrix.
• The correctly classified cases appear on diagonal because
the predicted and actual group are same.
• Off diagonal represents cases that have been incorrectly
classified.
• The sum of diagonal elements divided by number of cases
represent hit ratio.
Cutting score
• Criterion against which each individual's discriminant Z
score is compared to determine predicted group
membership.

• It represents the dividing point used to classify observation
into one of two groups based on their discriminant function
score.
• Optimal cutting score:
Discriminant Z score value that best separates the groups on
each discriminant function for classification purposes.
Optimal Cutting Score with
Equal Samples Sizes
Group A

Group B

_
ZA
Classify as A
(Nonpurchaser)

_

ZB
Classify as B
(Purchaser)
Optimal Cutting Score with
Unequal Samples Sizes
Optimal Weighted
Cutting Score

Unweighted
Cutting Score

Group B
Group A

_

ZA

_

ZB
Stage 5: Interpretation of the results
Three methods of determining the relative importance of
each independent variable.
• Standardized discriminant weights(coefficients)
• Discriminant loadings ( structure correlations)
• Partial F values
Discriminant weights(coefficient)
• To interpreting discriminant functions examines the
sign and the coefficient assigned to each variable in
computing the discriminant functions.
• Independent variables with larger coefficients
contribute more to the discriminating power of the
function than variables with smaller coefficients.
• The interpretation of discriminant coefficients is
similar to the interpretation of beta coefficients in
regression analysis.
Discriminant loadings
• It is referred as structure correlations.
• Loadings are increasingly used as a basis for interpretation
because of the deficiencies in utilizing coefficients.
• Unique characteristic: Loadings can be calculated for all
variables, whether they were used in the estimation of the
discriminant function or not. Particularly useful in
stepwise estimation.
• Loadings are more valid than coefficients for interpreting
the discriminant power of independent variables because of
their correlation nature.
• Loadings exceeding ±.40 are considered substantive for
interpretation purpose.
Partial F values
• When the stepwise method is selected, use of partial F
values interpreting the discriminant power of the
independent variables.
• Large F values indicate greater discriminatory power.
Validation
• Discriminant loadings are the preferred method to
assess the contribution of each variable to a
discriminant function because they are:
▫ A standardized measure of importance (ranging
from 0 to 1)
▫ Available for all independent variables whether used
in the estimation process or not
▫ Unaffected by multicollinearity
The discriminant function must be validated either
with a holdout sample or one of the “leave one out”
procedures.
Cluster Analysis
Cluster Analysis
 Statistical classification technique in

which cases, data, or objects (events, people, things, etc
.) are sub-divided into groups (clusters) such that
the items in a cluster are very similar (but not
identical) to one another and very different from the
items in other clusters.
Application of Cluster analysis
Research Problem
Select objectives:

Stage 1

Taxonomy description
Data simplification
Reveal relationships
Select clustering variables

Research Design Issues

Stage 2

Can outliers be detected?
Should
the
standardized?

data

be
Stage 2 continue

Select a Similarity Measure
Are the cluster variables metric or
non metric?

Non metric Data:
Metric data

Association of Similarity
Matching coefficients

Is the focus on pattern or
proximity?

Standardization Options
Standardizing variables
Standardizing by observation

Proximity:
Pattern:

Distance Measures of Similarity

correlation Measure of Similarity

Euclidean distance

Correlation coefficient

City-bloc distance
Mahalanobis distance

To stage 3
From stage
2
Assumptions
Is the sample representative of the
population?

Stage 3

Is Multicollinearity substantial enough
to affect results?
Selecting a Clustering Algorithm

Stage 4

Is a hierarchical, nonhierarchical, or
combination of the two methods used?
Hierarchical methods

Nonhierarchical methods

Combination

Linkage methods
available:

Assignment methods available:

Single linkage

Parallel threshold

Complete linkage

Optimization

Average linkage

Selecting seed points

Use a hierarchical
method to specify
cluster seed points
for a non
hierarchical
method

Sequential threshold

Ward’s method
Centroid method

How many clusters are formed?
Examine increases in agglomeration coefficient
Examine dendrogram and vertical icicle plots
Conceptual considerations
Cluster analysis Re-specification
Were any observations deleted as:
Outliers?
Members of small clusters?

No

Yes
From
Stage 4

Stage 5

Interpreting the Clusters
Examine cluster centroids
Name clusters based
clustering variables

on

Validating and Profiling the
Clusters

Stage 6

Validation
with
outcome variables

selected

Profiling
with
additional
descriptive variables
Governing principle

Maximization of homogeneity within clusters
and simultaneously Maximization of heterogeneity
across clusters
Three Basic Questions:
1.

How to measure similarity?

2.

How to form clusters?
(extraction method)

3.

How many clusters?
Answers to First Two Basic Questions:
1.

How to measure similarity?
• Distance – squared Euclidean.

2.

How to form clusters?
• Hierarchical – Wards method.
Third Basic Question: How many clusters?
1. Run cluster; examine solutions for
two, three, four, etc. clusters ??
2. Select number of clusters based on “a
priori” criteria, practical
judgment, common sense, theoretical
foundations, and statistical significance.
Steps in Cluster Analysis:
1. Identify the variables to be clustered.
2. Determine if clusters exist. To do so, verify the
clusters are statistically different and theoretically
meaningful.

3. Make an initial decision on how many clusters to use.
4. Where possible, validate clusters using an external
variable.

5. Describe the characteristics of the derived clusters
using demographics, psychographics, etc.
Objectives of cluster analysis
 Goal of cluster analysis is to partition a set of object into

two or more groups based on the similarity
 Cluster analysis is used for
 Taxonomy description: Identifying natural groups within
the data.
 Data simplification: The ability to analyze groups of
similar observations instead of all individual observation.
 Relationship identification: The simplified structure from
cluster analysis portrays relationships not revealed
otherwise.
Research Design in Cluster
Analysis

•
•
•

Outliers.
Similarity/Distance
Measures.
Standardizing the Data.
Outliers
 In a set of numbers, a number that is much larger or

much smaller than the rest of the numbers is called an
Outlier.
 Outliers" are values that "lie outside" the other values.
Similarity measure
 Three different forms of similarity measures are:
 Correlation Measures (require metric data)


Having widespread application, represent patterns rather
than proximity.

 Distance Measures (require metric data)


Best represents the concept of proximity, which is
fundamental to cluster analysis.

 Association Measures (require nonmetric data)


Represent the proximity of objects across a set of nonmetric
variables.
Types of distance measures
 Euclidean distance
 Squared (or absolute) Euclidean distance
 City – block (Manhattan) distance
 Chebychev distance

 Mahalanobis distance (D2 )
Euclidean distance
Y

*

B

(x2, y2)

y2-y1
A

*
(x1, y1)
x2-x1
X
d =
.

2

2

(x2-x1) + (y2-y1)
 Squared (or absolute) Euclidean distance
 It is the sum of the squared differences without taking the
square root.
 It is the distance measure for the Centroid and Ward’s
methods of clustering.
 City- block distance
 Uses the sum of the absolute differences of the variables
(i.e.) The two sides of a right triangle rather than the
hypotenuse.
 Simplest to calculate, but may lead to invalid clusters if
the clustering variables are highly correlated.
 Chebychev distance
 Another distance measure. It is particularly

susceptible to differences in scales across the
variables.
 Mahalanobis (or correlation) distance (D2 )
 This measure uses the correlation coefficient

between the observations and uses that as a measure
to cluster them.
Standardizing the data
 Cluster analysis using distance measures are quite

sensitive to differing scales or magnitudes among the
variables.
 Distance measures that use unstandardized data the

scale of the variables is changed.
Standardizing the variables
 Common form of standardization is the conversion of

each variable to standard scores (i.e. Z score)
 By subtracting the mean and dividing by the standard

deviation for each variable.
 It eliminates the effects due to scale differences not only
across variables, but for the same variable.
 Clustering variables should be standardized whenever
possible to avoid problems resulting from the use of
different scale values among clustering variables.
 A measure of Euclidean distance that directly incorporates
a standardization procedure is the Mahalanobis distance
(D2 ).
Standardizing by observation
 Standardizing by respondent would standardize each

question not to the sample’s average but instead to that
respondent’s average score.
 Within case or row centering standardization can

be quite effective in removing response style effects
 Standardization provides a remedy to a fundamental

issue in similarity measures and distance measures.
Cluster Analysis Assumptions:

Representative Sample.
•

The cluster analysis is only as good as the
representativeness of the sample. Therefore, all
efforts should be made to ensure that the sample is
representative and the results are generalizable to
the population.
Minimal Multicollinearity.
•

Input variables should be examined for strong
multicollinearity and if present:

•
•

Reduce the variables to equal numbers in each set
of correlated measures, or
Use a distance measure that compensates for the
correlation, such as Mahalanobis distance.
Deriving Clusters and Assessing Overall
Fit
 With the clustering variables selected and the

similarity matrix calculated, the partitioning process
begins.
 The researcher must:
 Select the partitioning procedure used for forming

clusters.
 Make the decision on the number of clusters to be
formed.
Partitioning Procedures
 To maximize the differences between clusters relative to

the variation within the cluster.
 The most widely used procedures can be classified as
 Hierarchical

 Nonhierarchical
Hierarchical

Non overlapping

Non-hierarchical

Agglomerative

Divisive

1a
1b

1a
1c

2

1b

1b2

1b1

Overlapping
Hierarchical Clustering
 Two main types of hierarchical clustering


Agglomerative:
 Start with the points as individual clusters
 At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left



Divisive:
 Start with one, all-inclusive cluster
 At each step, split a cluster until each cluster contains a point
(or there are k clusters)

 Traditional hierarchical algorithms use a similarity or distance matrix


Merge or split one cluster at a time
Agglomerative

OBS 1 *
OBS 2
Step 0:
Each observation
is treated as a
separate cluster

*

OBS 3

*

OBS 4 *
OBS 5

Distance Measure

*

OBS 6 *

0,2

0,4

0,6
Divisive

0,8

1,0
Cluster 1

OBS 1 *
OBS 2

Step 1:
Two observations
with smallest
pairwise distances
are clustered

*

OBS 3

*

OBS 4 *
OBS 5

*

OBS 6 *

0,2

0,4

0,6

0,8

1,0
Cluster 1

OBS 1 *
OBS 2

Step 2:
Two other observations
with smallest distances
amongst remaining
points/clusters
are clustered

Cluster 2

*

OBS 3

*

OBS 4 *
OBS 5

*

OBS 6 *

0,2

0,4

0,6

0,8

1,0
Cluster 1

OBS 1 *
OBS 2
OBS 3

Step 3:
Observation 3
joins
with cluster 1

Cluster 2

*

*

OBS 4 *
OBS 5

*

OBS 6 *

0,2

0,4

0,6

0,8

1,0
OBS 1 *
OBS 2

*

OBS 3

*

“Supercluster”

OBS 4 *

Step 4:
Cluster 1 and 2 - from
Step 3 joint into a
“Supercluster”

OBS 5

*

OBS 6 *

0,2
A single observation
remains unclustered (Outlier)

0,4

0,6

0,8

1,0
Five Most Agglomerative Algorithms
 Single linkage: Smallest distance from any object in one cluster to
any object in the other.

 Complete linkage: Largest distance between an observation in one
cluster and an observation in the other.

 Average linkage: Average distance between an observation in one
cluster and an observation in the other.

 Centroid Method: Distance between the centroids of two clusters.
 Ward’s Method: Between two clusters is the difference between the
total within cluster sum of squares for the two clusters separately, and
the within cluster sum of squares resulting from merging the two clusters
in cluster
88
Agglomerative Algorithms
*

*
*

Single Linkage:

*

*
*

Average Linkage:

Average distance

*
* ¤
*
*
*

*

*

¤*

*

*

Minimum distance

* *
* ¤
* *

Wards method:
*

* *
¤ *
* *

Minimization of
within-cluster variance

Complete Linkage:

Maximum distance

Centroid method:

Distance between
centres
Single linkage
Minimize shortest distance from cluster to point
A*

*G

*B
7,0

C *

H*

8,5

*D

*E
Complete linkage
Minimize longest distance from cluster to point

A*

*G

*B
10,5

C *

*D
9,5

H*

*E
Average linkage
Minimize average distance from cluster to point

A*

*G

*B
8,5

C *

9,0

H*

*D

*E
Hierarchical Clustering: Comparison
1

3
5

5

4

5

2

2

5

1

2

1

MIN

3

2

MAX

6

3

3

4

4

5

5

2

4

1
4

1
5

6

4

1

2
Ward’s Method

2
3

3

6

5

2

Group Average

3

1
4

6
1

4

3
Non Hierarchical Clustering
 Non hierarchical clustering is also called K- means

clustering.
 The process essentially has two steps:
 Specify cluster seeds:


To identify starting points for each cluster known as cluster
seeds. It is selected in a random process.

 Assignment


To assign each observation to one of the cluster seeds based on
similarity.
Selecting seed points
 How do we select the cluster seeds?

Classified into two basic categories:
Researcher specified:
 The researcher provides the seed points based
on external data. The two common sources of
the seed points are prior research or data from
another multivariate analysis.
 Sample generated:
 To generate the cluster seeds from the
observations of the sample , either in
systematic or random selection

Non Hierarchical Clustering Algorithms
 Sequential Threshold method - first determine a

cluster center, then group all objects that are within a
predetermined threshold from the center - one cluster is
created at a time
 Parallel Threshold method - simultaneously several
cluster centers are determined, then objects that are
within a predetermined threshold from the centers are
grouped
 Optimizing Partitioning method - first a nonhierarchical procedure is run, then objects are
reassigned so as to optimize an overall criterion.
Pros and Cons of Hierarchical Methods
 It is more popular clustering method with Ward’s

method and average linkage.
 Advantages:
 Simplicity

 Measures of similarity
 Speed
Disadvantages
 Hierarchical methods can be misleading because

undesirable early combinations may persist throughout
the analysis and lead to artificial results.
 To reduce the impact of outliers, the research analyze
the data several times, each time deleting problem
observations or outliers
Combination of Both Methods
 Hierarchical technique is used to generate a complete

set of cluster solutions, establish the cluster solutions,
profile cluster centers to act as cluster seed points, and
identify outliers.
 After outliers are eliminated, remaining observation
can be clustered by a non hierarchical method with the
cluster centers from the hierarchical results acting as
the initial seed points.
Decision on the number of cluster to be
formed
 Performing either a hierarchical or non hierarchical

cluster analysis is determining the number of clusters.

 Decision is critical for hierarchical techniques because

the researcher must select the cluster solution to
represent the data structure (called stopping rule).
Interpretation of the clusters
 The cluster Centroid (i.e. a mean of the cluster) is

particularly useful in interpretation.
 It involves calculate the distinguishing characteristics of

each clusters and identifying differences between
clusters.
 If cluster solutions fail to show large variation then the
other cluster solutions should be calculate.
 The cluster centroid should be assessed based on theory
or practical experience.
Validation
 Validation is essential in cluster analysis because the

clusters are descriptive of structure and require additional
support for their relevance.
 Cross- validation
 To cluster analyze separate samples
 Comparing the cluster solutions
 Assessing the correspondence of the results.

In these instances, a common approach is
 By creating two subsamples (randomly splitting the sample) and
then comparing the two cluster solutions for consistency with
respect to number of clusters and the cluster profiles.
Inferring Gene Functionality
 Researchers want to know the functions of new genes
 Simply comparing the new gene sequences to known DNA

sequences often does not give away the actual function of
gene
 For 40% of sequenced genes, functionality cannot be
ascertained by only comparing to sequences of other known
genes
 Microarrays allow biologists to infer gene function even
when there is not enough evidence to infer function based
on similarity alone
Microarray Analysis
 Microarrays measure the activity (expression level)
of the gene under varying conditions/time points
 Expression level is estimated by measuring the
amount of mRNA for that particular gene
 A gene is active if it is being transcribed
 More mRNA usually indicates more gene
activity
Microarray Experiments
 Analyze mRNA produced from cells in the tissue with the








environmental conditions you are testing
Produce cDNA from mRNA (DNA is more stable)
Attach phosphor to cDNA to see when a particular gene is
expressed
Different color phosphors are available to compare many
samples at once
Hybridize cDNA over the micro array
Scan the microarray with a phosphor-illuminating laser
Illumination reveals transcribed genes
Scan microarray multiple times for the different color
phosphor’s
Using Microarrays

• Track

the sample
over a period of time
to see gene
expression over time
•Track two different
samples under the
same conditions to
see the difference in
gene expressions
Each box represents
one gene’s expression
over time
Using Microarrays (cont’d)
 Green: expressed only from

control
 Red: expresses only from
experimental cell
 Yellow: equally expressed in
both samples
 Black: NOT expressed in
either control or experimental
cells
Microarray Data
 Microarray data are usually transformed into an

intensity matrix (below)
 The intensity matrix allows biologists to make
correlations between diferent genes (even if they are
dissimilar) and to understand how genes functions
might be related
 Clustering comes into play
Time:

Time Y

Time Z

Gene 1

Intensity (expression
level) of gene at
measured time

Time X
10

8

10

Gene 2

10

0

9

Gene 3

4

8.6

3

Gene 4

7

8

3

Gene 5

1

2

3
Microarray Data
 Microarray data are usually transformed into an

intensity matrix (below)
 The intensity matrix allows biologists to make
correlations between different genes (even if they are
dissimilar) and to understand how genes functions
might be related
 Clustering comes into play
Time:

Time Y

Time Z

Gene 1

Intensity (expression
level) of gene at
measured time

Time X
10

8

10

Gene 2

10

0

9

Gene 3

4

8.6

3

Gene 4

7

8

3

Gene 5

1

2

3
Clustering of Microarray Data
 Plot each datum as a point in N-dimensional space
 Make a distance matrix for the distance between every

two gene points in the N-dimensional space
 Genes with a small distance share the same expression
characteristics and might be functionally related or
similar!
 Clustering reveal groups of functionally related genes
Hierarchical clustering
Step 1: Transform genes * experiments matrix into
genes * genes distance matrix
Exp 1

Exp 2

Exp 3

Gene A

Exp 4

Gene A
Gene B
Gene C

Step 2: Cluster genes
based on distance matrix
and draw a dendrogram
until single node remains

Gene A
Gene B
Gene C

Gene B

Gene C

0
?
?

0
?

0
Data and distance matrix
Genes

A
B
C
D
E

A
0.0

Patients
A
B
C
D
E
B
223.6
0.0

1
90
190
90
200
150

2
190
390
110
400
200
C
80.0
297.3
0.0

D
237.1
14.1
310.2
0.0

E
60.8
194.2
108.2
206.2
0.0
Hierarchical clustering (continued)
G1
G2
G3
G4
G5

G1
0
2
6
10
9

G2
0
5
9
8

G3

0
4
5

G4

0
3

G5

G (12)
G3
G4
G5

G (12)
0
6
10
9

2 3 4

G4

G5

0
4
5

0
3

0

0

G (12)
G3
G (45)

1

G3

5

Stage
P5
P4
P3
P2
P1

G (12)
0
6
10

G3

G (45)

0
5

0

Groups
[1], [2], [3], [4], [5]
[1 2], [3], [4], [5]
[1 2], [3], [4 5]
[1 2], [3 4 5]
[1 2 3 4 5]
Clustering of Microarray Data (cont’d)

Clusters
Hierarchical Clustering
Hierarchical Clustering: Example
Hierarchical Clustering: Example
Hierarchical Clustering: Example
Hierarchical Clustering: Example







Factor analysis is an interdependence technique.
Analyzing the correlation among a large number of
variables.
To summarize the information with a minimum loss of
information.
In factor analysis, we group variables by their
correlations, such that variables in a group (factor) have
high correlations with each other.
• Research problem
▫ Is the analysis exploratory or confirmatory?
▫ Select objectives:
 Data summarization
 Data reduction
• Confirmatory
▫ Structural equation modeling
• Exploratory
▫ Select the type of Factor Analysis
What is being grouped – variables or cases?
• Cases
▫ Q – type factor analysis or cluster analysis.
• Variables
▫ R – type factor analysis.

• Research design
▫ What variables are included?
▫ How are the variables measured?
▫ What is the desired sample size?

• Assumptions
▫ Statistical considerations of normality, linearity, and
homoscedasticity.
▫ Homogeneity of sample
▫ Conceptual linkages
• Selecting a Factor Method
▫ Is the total variance or only common variance
analyzed?

• Total variance
▫ Extract factors with component analysis

• Common variance
▫ Extract factors with common factor analysis

• Specifying the Factor Matrix
▫ Determine the number of factors to be retained
• Selecting a Rotational Method
▫ Should the factors be correlated (oblique) or uncorrelated (orthogonal)?
• Orthogonal Methods
▫ VARIMAX
▫ EQUIMAX
▫ QUARTIMAX
• Oblique Methods
▫ Oblimin
▫ Promax
▫ Orthoblique
• Interpreting the Rotated Factor Matrix
▫ Can significant loadings be found?
▫ Can factors be named?
▫ Are communalities sufficient?
 If no, selecting a factor method
 If yes, go to factor model respecification
• Factor model respecification
▫ Were any variables deleted?
▫ Do you want to change the number of factors?
▫ Do you want another type of rotation?
If yes, selecting a Factor Method
If no, go to validation of the factor matrix

• Validation of the Factor Matrix
▫ Split/multiple samples
▫ Separate analysis for subgroups
▫ Identify influential cases
• Additional Uses
▫ Selection of Surrogate Variables
▫ Computation of Factor Scores
▫ Creation of Summated Scales
• Factor – summarize the original set of observed
variables
• Factor loadings – correlation between original
variables and the factors.

• Squared factor loadings – percentage of the
variance in an original variable is explained by a
factor.
Communality

• In factor Analysis ,a measure of the percentage of
a variable’s variation that is explained by the
factors .
• A relative high communality indicates that a
variable has much in common with the other
variables taken as a group.
Specifying the Unit of Analysis
• First select the unit of analysis for factor analysis
▫ Variables (or)
▫ Respondents
• Factor analysis would be applied to a correlation matrix of
the variables.
o R factor analysis – common type of factor analysis for
variables
• Factor analysis also may be applied to a correlation matrix of
the individual respondents based on their characteristics.
o Q factor analysis – Respondents
o Q factor analysis is not utilized frequently because of difficult to
calculate.
o Instead , some type of cluster analysis is used to group individual
respondents.
Data summarization
• Explain the data in a smaller number of concepts
that equally represent the original set of variables.
Variable selection
• Whether factor analysis is used for data reduction
or summarization, should consider the
conceptual basis of the variables.
• In assessing the dimension of store image, if no
question on store workers were included, factor
analysis would not be able to identify this
dimension.
• Factor analysis is the “garbage in , garbage out”
phenomenon.
• If the researcher includes a large number of
variables and hopes that factor analysis will
“figure it out,” then there is a high possibility of
poor results .
Designing a factor analysis
• Factor analysis involves three basic decisions:
▫ Correlation among variables or respondents
▫ Variables selection and measurement issues
▫ Sample size
Correlations among variables or
respondents
• Two forms of factor analysis. Both utilize a correlation
matrix as the basic data input.
▫ R type - use a traditional correlation matrix as input.
▫ Q type – use a factor matrix that would identify similar
individuals.
Q factor analysis is different from cluster analysis.
Q – type factor analysis form grouping based on the
intercorrelations between the respondents.
Cluster analysis form grouping based on a distance based
similarity measure.
Variable selection and measurement
issues
• Two specific questions must be answered:
▫ What type of variables can be used in factor analysis?
▫ How many variables should be included?

• Correlations are easy to find in metric variables.
• Non metric variables are more problematic .
• To define dummy variables (coded 0-1) to represent
categories of non metric variables then correlation is
possible to find.
• Boolean factor analysis are more appropriate if all the
variables are dummy variables.
• If a study is being designed to reveal factor structure, strive
to have at least five variables for each proposed factor.
Sample size
• For sample size:
▫ The sample must have more observations than
variables.
▫ The minimum absolute sample size should be 50
observations.

• Maximize the number of observations per
variable, with a minimum of 5 and at least 10
observation per variable.
Need of factor analysis
• The difficulties in a having too many
independent variables in predicting the
response variable are :
▫
▫
▫
▫
▫

Increased computational time to get solution .
Increased time in data collection
Too much expenditure in data collection
Presence of redundant independent
Difficulty in making inference .

• These can be avoided using Factor Analysis
• Factor analysis aims at grouping the original input
variables into factors which underlie the input variables

• The total no of factors = total no of input variables
But after performing Factor Analysis
• The total no of factors in the study can be reduced by
dropping the insignificant factors based on Certain Criteria
Objective of factor analysis
• The main objective of Factor analysis is to
summarize a large number of underlying factors into a
smaller number of variables or factors which represent the
basic factors underlying the data.
• Factor analysis is used to uncover the latent
structure(dimensions) of a set of variables.
• It reduces attribute space from a larger number of variables
to a smaller number of factors and as such is a
“nondependent" procedure (that is, it does not assume a
dependent variable is specified).
Assumptions
• Factor analysis is designed for interval data, although it
can also be used for ordinal data
• The variables used in factor analysis should be linearly
related to each other. This can be checked by looking at
scatter plots of pairs of variables.
• Obviously the variables must also be at least moderately
correlated to each other, otherwise the number of factors
will be almost the same as the number of original
variables, which means that carrying out a factor analysis
would be pointless.
Method of determining the
appropriateness of factor analysis
• If correlations is not greater than 0.30 then factor analysis
is probably in appropriate.
• The correlations among variables can also be analyzed by
computing the partial correlations among variables.
• If partial correlations are high then factor analysis is
inappropriate.
• Partial correlation should be small, because the variable
can be explained by the variables loading on the factors.
• Bartlett test of sphericity:
▫ It is a statistical test for the presence of correlations among the
variables.
▫ A statistically significant Bartlett’s test of sphericity (sig >0.50)
indicates that sufficient correlations exist among the variables to
proceed.

• Measure of sampling adequacy (MSA):
o MSA value must exceed 0.50 for both the overall test and each
individual variable.
o Variables with values less than 0.50 should be omitted from the
factor analysis one at a time, with the smallest one being omitted
each time.

• The MSA increases as:
o
o
o
o

The sample size increases
The average correlations increase
The number of variables increases
The number of factors decreases.
Selecting a Factor extraction method
• Before selecting the methods of factor extraction, must
have some understanding of the variance for a variables
and how it is divided or partitioned.

• For the purpose of factor analysis, it is important to
understand how much of a variables variance is shared with
other variables.
• The total variance of any variable can be divided into three
types of variance.
▫ Common variance: Variance in a variable that is shared
with all other variables in the analysis.
▫ Specific variance (unique variance): variance associated
with only a specific variable. This variance cannot be
explained by the correlation to the other variables
▫ Error variance: It is also variance that cannot be
explained by correlations with other variables.
 As a variable is more highly correlated with one or more
variables, the common variances (communality)
increases.
 Unreliable measures or other sources of error variance are
introduced, then the common variance is reduced.
Factor analysis Vs principal component
analysis
Factor analysis

Principal component analysis

• It analyzes only the variance
shared among the variables
(common variance without
error or unique variance).
• It adjusts the diagonals of the
correlation matrix with the
unique factors.
• The component score in PCA
represent a linear combination
of the observed variables
weighted by Eigen vectors.
• PCA do not represent
underlying constructs.

• It analyzes total variance.
• It inserts 1’s on the diagonals of
the correlation matrix.
• The observed variables in FA are
linear combination of the
underlying and unique factors.
• FA underlying constructs can be
labeled and readily interpreted,
given an accurate model
specification.
• Both models yield similar
results if the number of variables
exceeds 30 or the
communalities exceed 0.60.
Number of factors to extract
• Any decision on the number of factors to be
retained should be based on several considerations:
▫ Factors with Eigen values greater than 1.0
▫ A predetermined number of factors based on
research objectives and prior research
▫ Enough factors to meet a specified percentage of
variance explained, usually 60% or higher.
▫ Factors shown by the scree test to have substantial
amounts of common variance.
3.0

Scree Plot

Eigenvalue

2.5
2.0
1.5
1.0

0.5
0.0
1

2

3
4
5
Component Number

6
Interpreting the factors
• The three processes of factor interpretation
▫ Estimate the factor matrix
 First unrotated factor matrix is computed, containing
the factor loadings for each variable on each factor.

▫ Factor rotation
▫ Factor interpretation and respecification
150

Rotating factors
• When the factors are extracted, factor loading is
obtained. Factor loadings are the correlation of each
variable and the factor. When rotating the factors, the
variance has been redistributed so that the factor
loading pattern and percentage of variance for each of
the factors is difference.
• The objectives of rotating is to redistribute the variance
from earlier factors to later ones to achieve a
simple, theoretically more meaningful factor
pattern, and make the result easily to be interpreted
• Two type of factor rotation
1. Orthogonal factor rotation
2. Oblique facto rotation
Orthogonal factor rotation
• Orthogonal rotation the axes are maintained at right
angles. The objective of all methods of rotation is to
simplify the rows and columns of the factor matrix.
• By simplifying the rows, making as many values in
each row as close to zero as possible.
• By simplifying the columns, making as many values in
each column as close to zero as possible.
There are three rotation methods
• 1) Quartimax,
• 2) Varimax,
• 3) Equimax.
Orthogonal factor rotation
Unrotated factor II

Rotated factor II

+1.0
V1
V2
+.50

-1.0

-.50

0

+.50

+1.0

V4
-.50

-1.0

V5

Unrotated
factor I

V3

Rotated
factor I
153

1.The quartimax rotation is to simplify the rows of a factor
matrix, i.e. focus on rotating the initial factor so that a
variable loads high on one factor and as low as possible on
all other factor.
2. The varimax rotation is to simplify the columns of the
factor matrix. With this approach, the maximum possible
simplification is reached of there are only 1’s and 0’s in a
column.
3.The equimax rotation is a compromise between the
quartimax and varimax.
In practice, the first two are the most common one to apply.
Oblique rotation methods
• Oblique rotations are similar to orthogonal rotation.
• It allow correlated factors instead of maintaining
independence between the rotated factors.
• Oblique rotation the axes need not be maintained at
right angle.
• It represents the clustering of variables more
accurately.
• There are three rotation methods
▫ Oblimin
▫ Promax
▫ orthoblique
Oblique factor rotation
Unrotated factor II
+1.0

Orthogonal Rotated factor II
V1

Oblique rotation factor II
V2

+.50

-1.0

-.50

0

+.50

+1.0

Unrotated
factor I

V3
V4
-.50

-1.0

V5

Oblique rotation I
Orthogonal
Rotated
factor I
Assessing factor analysis
• In interpreting factors, a decision must be made regarding
the factor loadings worth consideration and attention.

• Loadings exceeding 0.70 are considered indicative of welldefined structure and are the goal of any factor analysis.
Interpreting a factor matrix
• Step 1: Examine the factor matrix of loadings.
▫ The factor loading matrix contains the factor loading of
each variable on each factor.
▫ Rotated loadings are usually used in factor interpretation
unless data reduction is the sole objective.
▫ An oblique rotation has been used, two matrices of factor
loadings are provided.
 Factor pattern matrix
 Factor structure matrix
 Factor pattern matrix
 Represent the unique contribution of each variable to the
factor.

 Factor structure matrix
 Simple correlation between variables and factors, but
these loadings contain both the unique variance between
variables and factors and the correlation among factors.
Validation of factor analysis
Assessing the degree of generalizability of the results to the
population and potential influence of individual cases on
the overall results.
• Use of a confirmatory perspective:
▫ The direct method of validating the results is to use a
confirmatory perspective.
▫ Assess the replicability of the results either with a split
sample in the original data set or with a separate sample.
Assessing factor structure stability
• Factors stability is primarily dependent on the sample size
and on the number of cases per variable.

• Comparison of the two resulting factor matrices will
provide an assessment of the robustness of the solution.
Detecting influential observations
• The another issue to the validation of factor
analysis is the detecting influential observations.

• To estimate the model with and without
observations identified as outliers to assess their
impact on the results.
Additional uses of factor analysis results
• Objective is
▫ To identify logical combinations of variables and better
understand the interrelationships among variables, then
factor interpretation will enough.
▫ To identify appropriate variables for subsequent application to
other statistical techniques, then some form of data reduction
will be used.
▫ There are three option for data reduction
 Summated scale
 surrogate variable
 Factor score
Summated Scales
• One of the common uses of factor analysis is the formation
of summated scales, where we add the scores on all the
variables loading on a component to create the score for
the component.
• To verify that the variables for a component are measuring
similar entities that are legitimate to add together, we
compute Chronbach's alpha.

• If Chronbach's alpha is 0.70 or greater (0.60 or greater for
exploratory research), we have support on the interval
consistency of the items justifying their use in a summated
scale.
Surrogate variable
• The option of examining the factor matrix and
selecting the variable with the highest factor loading
on each factor to act as a surrogate variable.
• The selection process is more difficult because two or
more variables have loadings that are significant and
close to each other.
• Disadvantages of selecting a single surrogate variable
▫ It does not address the issue of measurement error.
▫ It also runs the risk of potentially misleading results by
selecting only a single variable to represent a more
complex result.
Factor analysis calculating a summated scale or factor
scores instead of the surrogate variable.
Factor Score
• A number that represents each observations calculated value on
each factor in a factor analysis.
• At the initial stage ,the respondents assign scores for the
variables. After performing factor analysis, each factor assigns a
score for each respondent. Such score are called respondent
factor score.
• Factor scores are standardized to have a mean of ‘0’ and a
standard deviation of ‘1’.

More Related Content

What's hot

INFERENTIAL TECHNIQUES. Inferential Stat. pt 3
INFERENTIAL TECHNIQUES. Inferential Stat. pt 3INFERENTIAL TECHNIQUES. Inferential Stat. pt 3
INFERENTIAL TECHNIQUES. Inferential Stat. pt 3John Labrador
 
A.1 properties of point estimators
A.1 properties of point estimatorsA.1 properties of point estimators
A.1 properties of point estimatorsUlster BOCES
 
Anova, ancova, manova thiyagu
Anova, ancova, manova   thiyaguAnova, ancova, manova   thiyagu
Anova, ancova, manova thiyaguThiyagu K
 
Factor analysis
Factor analysisFactor analysis
Factor analysisDhruv Goel
 
Multivariate data analysis
Multivariate data analysisMultivariate data analysis
Multivariate data analysisSetia Pramana
 
Factor analysis
Factor analysisFactor analysis
Factor analysissaba khan
 
When to use, What Statistical Test for data Analysis modified.pptx
When to use, What Statistical Test for data Analysis modified.pptxWhen to use, What Statistical Test for data Analysis modified.pptx
When to use, What Statistical Test for data Analysis modified.pptxAsokan R
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsSarfraz Ahmad
 
Frequency Distributions
Frequency DistributionsFrequency Distributions
Frequency Distributionsjasondroesch
 
Estimation and hypothesis
Estimation and hypothesisEstimation and hypothesis
Estimation and hypothesisJunaid Ijaz
 
Univariate & bivariate analysis
Univariate & bivariate analysisUnivariate & bivariate analysis
Univariate & bivariate analysissristi1992
 

What's hot (20)

Multivariate analysis
Multivariate analysisMultivariate analysis
Multivariate analysis
 
INFERENTIAL TECHNIQUES. Inferential Stat. pt 3
INFERENTIAL TECHNIQUES. Inferential Stat. pt 3INFERENTIAL TECHNIQUES. Inferential Stat. pt 3
INFERENTIAL TECHNIQUES. Inferential Stat. pt 3
 
A.1 properties of point estimators
A.1 properties of point estimatorsA.1 properties of point estimators
A.1 properties of point estimators
 
Anova, ancova, manova thiyagu
Anova, ancova, manova   thiyaguAnova, ancova, manova   thiyagu
Anova, ancova, manova thiyagu
 
Multivariate Analysis
Multivariate AnalysisMultivariate Analysis
Multivariate Analysis
 
Parametric Test
Parametric TestParametric Test
Parametric Test
 
Factor analysis
Factor analysisFactor analysis
Factor analysis
 
DATA Types
DATA TypesDATA Types
DATA Types
 
Factor analysis
Factor analysisFactor analysis
Factor analysis
 
1.2 types of data
1.2 types of data1.2 types of data
1.2 types of data
 
Multivariate data analysis
Multivariate data analysisMultivariate data analysis
Multivariate data analysis
 
Summarizing data
Summarizing dataSummarizing data
Summarizing data
 
Factor analysis
Factor analysisFactor analysis
Factor analysis
 
When to use, What Statistical Test for data Analysis modified.pptx
When to use, What Statistical Test for data Analysis modified.pptxWhen to use, What Statistical Test for data Analysis modified.pptx
When to use, What Statistical Test for data Analysis modified.pptx
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Frequency Distributions
Frequency DistributionsFrequency Distributions
Frequency Distributions
 
Estimation and hypothesis
Estimation and hypothesisEstimation and hypothesis
Estimation and hypothesis
 
Univariate & bivariate analysis
Univariate & bivariate analysisUnivariate & bivariate analysis
Univariate & bivariate analysis
 
Statistical Distributions
Statistical DistributionsStatistical Distributions
Statistical Distributions
 
Regression
RegressionRegression
Regression
 

Viewers also liked

Lesson04_new
Lesson04_newLesson04_new
Lesson04_newshengvn
 
Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.Piotr Milanowski
 
Solving dynamics problems with matlab
Solving dynamics problems with matlabSolving dynamics problems with matlab
Solving dynamics problems with matlabSérgio Castilho
 
Mba2216 week 11 data analysis part 03 appendix
Mba2216 week 11 data analysis part 03 appendixMba2216 week 11 data analysis part 03 appendix
Mba2216 week 11 data analysis part 03 appendixStephen Ong
 
Applications of Integrations
Applications of IntegrationsApplications of Integrations
Applications of Integrationsitutor
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsWes McKinney
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney
 
Integration
IntegrationIntegration
Integrationsuefee
 
Methods of multivariate analysis
Methods of multivariate analysisMethods of multivariate analysis
Methods of multivariate analysisharamaya university
 
THE CALCULUS INTEGRAL (Beta Version 2009)
THE CALCULUS INTEGRAL (Beta Version 2009)THE CALCULUS INTEGRAL (Beta Version 2009)
THE CALCULUS INTEGRAL (Beta Version 2009)briansthomson
 
Integral Calculus
Integral CalculusIntegral Calculus
Integral Calculusitutor
 
ppt on application of integrals
ppt on application of integralsppt on application of integrals
ppt on application of integralsharshid panchal
 
Applied Multivariate Techniques
Applied Multivariate TechniquesApplied Multivariate Techniques
Applied Multivariate Techniquesdivyaj82
 

Viewers also liked (20)

Lesson04_new
Lesson04_newLesson04_new
Lesson04_new
 
Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.Statistical inference for (Python) Data Analysis. An introduction.
Statistical inference for (Python) Data Analysis. An introduction.
 
Climate
ClimateClimate
Climate
 
Chap019
Chap019Chap019
Chap019
 
Solving dynamics problems with matlab
Solving dynamics problems with matlabSolving dynamics problems with matlab
Solving dynamics problems with matlab
 
Mba2216 week 11 data analysis part 03 appendix
Mba2216 week 11 data analysis part 03 appendixMba2216 week 11 data analysis part 03 appendix
Mba2216 week 11 data analysis part 03 appendix
 
Integration
IntegrationIntegration
Integration
 
Applications of Integrations
Applications of IntegrationsApplications of Integrations
Applications of Integrations
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodels
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 
Integration
IntegrationIntegration
Integration
 
Integration Ppt
Integration PptIntegration Ppt
Integration Ppt
 
Methods of multivariate analysis
Methods of multivariate analysisMethods of multivariate analysis
Methods of multivariate analysis
 
Integral calculus
Integral calculusIntegral calculus
Integral calculus
 
THE CALCULUS INTEGRAL (Beta Version 2009)
THE CALCULUS INTEGRAL (Beta Version 2009)THE CALCULUS INTEGRAL (Beta Version 2009)
THE CALCULUS INTEGRAL (Beta Version 2009)
 
Gate mathematics
Gate mathematicsGate mathematics
Gate mathematics
 
Discriminant analysis
Discriminant analysisDiscriminant analysis
Discriminant analysis
 
Integral Calculus
Integral CalculusIntegral Calculus
Integral Calculus
 
ppt on application of integrals
ppt on application of integralsppt on application of integrals
ppt on application of integrals
 
Applied Multivariate Techniques
Applied Multivariate TechniquesApplied Multivariate Techniques
Applied Multivariate Techniques
 

Similar to Multivariate

April Heyward Research Methods Class Session - 8-5-2021
April Heyward Research Methods Class Session - 8-5-2021April Heyward Research Methods Class Session - 8-5-2021
April Heyward Research Methods Class Session - 8-5-2021April Heyward
 
Kinds Of Variables Kato Begum
Kinds Of Variables Kato BegumKinds Of Variables Kato Begum
Kinds Of Variables Kato BegumDr. Cupid Lucid
 
QUANTITATIVE RESEARCH DESIGN AND METHODS.ppt
QUANTITATIVE RESEARCH DESIGN AND METHODS.pptQUANTITATIVE RESEARCH DESIGN AND METHODS.ppt
QUANTITATIVE RESEARCH DESIGN AND METHODS.pptBhawna173140
 
Research methods 2 operationalization & measurement
Research methods 2   operationalization & measurementResearch methods 2   operationalization & measurement
Research methods 2 operationalization & measurementattique1960
 
Research Methodology3_Measurement.pptx
Research Methodology3_Measurement.pptxResearch Methodology3_Measurement.pptx
Research Methodology3_Measurement.pptxAamirMaqsood8
 
Data Analysis with SPSS PPT.pdf
Data Analysis with SPSS PPT.pdfData Analysis with SPSS PPT.pdf
Data Analysis with SPSS PPT.pdfThanavathi C
 
Poe_STUDY GUIDE_term 2.docx.pptx
Poe_STUDY GUIDE_term 2.docx.pptxPoe_STUDY GUIDE_term 2.docx.pptx
Poe_STUDY GUIDE_term 2.docx.pptxBlackStunnerjunior
 
this activity is designed for you to explore the continuum of an a.docx
this activity is designed for you to explore the continuum of an a.docxthis activity is designed for you to explore the continuum of an a.docx
this activity is designed for you to explore the continuum of an a.docxhowardh5
 
Topic validity
Topic validityTopic validity
Topic validitymikki khan
 
Measurement and scaling
Measurement and scalingMeasurement and scaling
Measurement and scalingBalaji P
 
Selection of appropriate data analysis technique
Selection of appropriate data analysis techniqueSelection of appropriate data analysis technique
Selection of appropriate data analysis techniqueRajaKrishnan M
 
scales of measurement - maneesh jha.pptx
scales of measurement - maneesh jha.pptxscales of measurement - maneesh jha.pptx
scales of measurement - maneesh jha.pptxManeesh Jha
 
Quantitative Data Analysis: Hypothesis Testing
Quantitative Data Analysis: Hypothesis TestingQuantitative Data Analysis: Hypothesis Testing
Quantitative Data Analysis: Hypothesis TestingMurni Mohd Yusof
 
Measurement of scales
Measurement of scalesMeasurement of scales
Measurement of scalesAkifIshaque
 
Measurement & Scales.pptx
Measurement & Scales.pptxMeasurement & Scales.pptx
Measurement & Scales.pptxdrcharlydaniel
 
Scale development khalid-key concepts
Scale development khalid-key conceptsScale development khalid-key concepts
Scale development khalid-key conceptsKhalid Mahmood
 
Scaling 120121081027-phpapp01
Scaling 120121081027-phpapp01Scaling 120121081027-phpapp01
Scaling 120121081027-phpapp01Surabhi Prajapati
 

Similar to Multivariate (20)

April Heyward Research Methods Class Session - 8-5-2021
April Heyward Research Methods Class Session - 8-5-2021April Heyward Research Methods Class Session - 8-5-2021
April Heyward Research Methods Class Session - 8-5-2021
 
Spring 2014 chapter 1
Spring 2014 chapter 1Spring 2014 chapter 1
Spring 2014 chapter 1
 
Kinds Of Variables Kato Begum
Kinds Of Variables Kato BegumKinds Of Variables Kato Begum
Kinds Of Variables Kato Begum
 
QUANTITATIVE RESEARCH DESIGN AND METHODS.ppt
QUANTITATIVE RESEARCH DESIGN AND METHODS.pptQUANTITATIVE RESEARCH DESIGN AND METHODS.ppt
QUANTITATIVE RESEARCH DESIGN AND METHODS.ppt
 
Research methods 2 operationalization & measurement
Research methods 2   operationalization & measurementResearch methods 2   operationalization & measurement
Research methods 2 operationalization & measurement
 
Research Methodology3_Measurement.pptx
Research Methodology3_Measurement.pptxResearch Methodology3_Measurement.pptx
Research Methodology3_Measurement.pptx
 
Data Analysis with SPSS PPT.pdf
Data Analysis with SPSS PPT.pdfData Analysis with SPSS PPT.pdf
Data Analysis with SPSS PPT.pdf
 
Poe_STUDY GUIDE_term 2.docx.pptx
Poe_STUDY GUIDE_term 2.docx.pptxPoe_STUDY GUIDE_term 2.docx.pptx
Poe_STUDY GUIDE_term 2.docx.pptx
 
Chapter_1_Lecture.pptx
Chapter_1_Lecture.pptxChapter_1_Lecture.pptx
Chapter_1_Lecture.pptx
 
this activity is designed for you to explore the continuum of an a.docx
this activity is designed for you to explore the continuum of an a.docxthis activity is designed for you to explore the continuum of an a.docx
this activity is designed for you to explore the continuum of an a.docx
 
Topic validity
Topic validityTopic validity
Topic validity
 
Measurement and scaling
Measurement and scalingMeasurement and scaling
Measurement and scaling
 
Selection of appropriate data analysis technique
Selection of appropriate data analysis techniqueSelection of appropriate data analysis technique
Selection of appropriate data analysis technique
 
scales of measurement - maneesh jha.pptx
scales of measurement - maneesh jha.pptxscales of measurement - maneesh jha.pptx
scales of measurement - maneesh jha.pptx
 
Quantitative Data Analysis: Hypothesis Testing
Quantitative Data Analysis: Hypothesis TestingQuantitative Data Analysis: Hypothesis Testing
Quantitative Data Analysis: Hypothesis Testing
 
Discriminant analysis.pptx
Discriminant analysis.pptxDiscriminant analysis.pptx
Discriminant analysis.pptx
 
Measurement of scales
Measurement of scalesMeasurement of scales
Measurement of scales
 
Measurement & Scales.pptx
Measurement & Scales.pptxMeasurement & Scales.pptx
Measurement & Scales.pptx
 
Scale development khalid-key concepts
Scale development khalid-key conceptsScale development khalid-key concepts
Scale development khalid-key concepts
 
Scaling 120121081027-phpapp01
Scaling 120121081027-phpapp01Scaling 120121081027-phpapp01
Scaling 120121081027-phpapp01
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Multivariate

  • 1. Multivariate Analysis Analysis of multiple variables in a single relationship or set of relationships. 
  • 2. Some basic concepts of multivariate analysis The Variate  Measurement scales  Measurement error and multivariate measurement.  Statistical significance Vs Statistical power 
  • 3. 1.The Variate Variate is also called linear combination.  The linear combination of variables with empirically determined weights. Variate value = w1x1 + w2x2 + ...+ wnxn x1 ,x2 ,..xn = Observed variable w1 , w2 ,.. wn = Weight  The variables are specified by the researcher.  The weights are determined by the 
  • 4. 2.Measurement Scales Data analysis involves the identification and measurement of variation in a set of variables.  The researcher cannot identify variation unless it can be measured.  Measurement is important for representing the concept and selection of appropriate multivariate method of analysis. 
  • 5. Measurement scales can be classified into two categories  Non metric ( Qualitative)  Metric (Quantitative)
  • 6. Non metric measurement scales   Also called qualitative data. Measures that describe by indicating the presence or absence of characteristic or property are called non metric data. For e.g.: If a person is male, he cannot be female. An “ amount” of gender is not possible, just state of being male or female. Qualitative measurements can be made with either a nominal or an ordinal scale.
  • 7. Nominal Scales A nominal scale assigns numbers to identify subjects or objects.  Nominal scales also known as categorical scales. For e.g. In representing gender the researcher might assign numbers to each category (i.e) Assign 2 for females Assign 1 for males. 
  • 8. Ordinal Scales  In ordinal scales, variables can be ordered or ranked.  Every subject or object can be compared with another in terms of a “greater than” or “less than” relationship.  It provide only the order of the value, but not measure of the actual amount. For e.g. opinion scales that go from “most important” to “least important” or “strongly agree” to “strongly disagree”.
  • 9.  Using ordinal scale we cannot find the amount of the differences between products. (i.e.) we cannot answer the question whether the difference between Product A and B is greater than the difference between Product B and C.
  • 10. Metric Measurement Scales Metric data also called quantitative data, these measurements identify subjects not only on the possession of an attribute but also by the amount to which the subject may be characterized by the attribute. For e.g. age, height.  Two metric measurement scales are  Interval scales  Ratio scales 
  • 11. Interval scales Data of real numbers, numbers with a zero point and can be divided and compared into other ratio numbers. For e.g. income, weight, height and age.  We can answer the question whether the difference between Product A and B is greater than the difference between Product B and C. 
  • 13. Measurement Error  Not measuring the “true” variable values accurately due to the inappropriate response scales, data entry error, or respondent errors.  For e.g. 1. Imposing 7-point rating scales for attribute measurement when the researcher knows the respondents can accurately respond only to a 3- point rating .  2. Responses as to household income may be reasonably accurate but rarely totally precise. All variables used in multivariate techniques must be assumed to have some degree of measurement error.
  • 14. Validity  Validity – extent to which a measure correctly represents the concept of study. (i.e.) the degree to which it is free form any systematic or nonrandom error.  validity relates not to what should be measured, but instead to how it is measured
  • 15. Reliability Reliability is the degree to which the observed variable measures the “true” value and is “error free”.  More reliable measures will show greater consistency than less reliable measures.  Choose the variable with the higher reliability.  Validity is concerned with how well the concept is defined by the measures, whereas reliability relates to the consistency of the measures. 
  • 16. Multivariate measurement  Use of two or more variables as indicators(i.e. single variable used in conjunction with one or more other variables to form a composite measure) of a single composite measure.  For e.g. A personality test may provide the answers to a series of individual questions, which are then combined to form a single score representing the personality trait.
  • 17. Statistical Significance  All multivariate techniques are based on the statistical inference of a population or a randomly drawn sample of that population.  Interpreting statistical inferences, the researcher specify the acceptable levels of statistical error.  5% or 1% Level of significance – it means 95% certain that our sample results are not due to chance.
  • 18. Types of statistical error H0 is true Accept Reject  H0 is false 1-α β Type 2 error α Type 1 error 1-β Power There are two types of errors Type 1 error Type 2 error
  • 19.  Type 1 error is the probability of rejecting the null hypothesis when it is true. It is also known as producer’s risk. Probability of type 1 error is alpha(α).  Type 2 error is the probability of failing to reject the null hypothesis when it is false. It is also known as consumer’s risk. Probability of type 2 error is beta(β).
  • 20. Statistical power  Power is the probability of correctly rejecting the null hypothesis when it is false.  Power of statistical inference test is 1-β  Increased sample sizes produce greater power of the statistical test.  Researchers should always design the study to achieve a power level of 0.80 at the desired significance level.
  • 21. What type of relationship is being examined? Dependence Interdependence How many variables are being predicted? Is the structure of relationships among: One dependent variable in a single relationship Cases/Respondents What is the measurement scale of the dependent variable? Cluster analysis Metric Non metric 1.Multiple regression 2. conjoint analysis 1.Multiple discriminant analysis 2. Linear probability
  • 22. What type of relationship is being examined? Dependence How many variables are being predicted? Multiple relationships of dependent and independent variables Several dependent variables in single relationship Structural equation modeling What is the measurement scale of the dependent variable? Metric Non metric What is the measurement scale of the predictor variable? Canonical correlation analysis with dummy variables Metric Non metric Canonical correlation analysis Multivariate analysis of variance
  • 23. What type of relationship is being examined? interdependence Is the structure of relationships among: Variables Factor analysis Objects How are the attributes measured? Confirmatory factor analysis Metric Non metric Multidimensional scaling Correspondence analysis
  • 24. Discriminant analysis • Discriminant analysis is used when the dependent variable is a non metric variable and the independent variables are metric variables. • Total sample can be divided into groups based on a qualitative dependent variable.
  • 25. Discriminant Analysis • Discriminant analysis also refers to a wider family of techniques ▫ Still for discrete response, continuous predictors ▫ Produces discriminant functions that classify observations into groups  These can be linear or quadratic functions  Can also be based on non-parametric techniques
  • 26. Objectives • To understand group differences and correctly classifying objects into groups or classes. • It is used to distinguish innovators from non innovators according to their demographic and psychographic profiles. For e.g. Distinguishing males from females, good credit from poor credit.
  • 27. Application of discriminant analysis Research problem Select objectives: Evaluate group differences on a multivariate profile Classify observations into groups Identify dimensions of discrimination between groups • Stage 1 Research Design Issues Selection of independent variables Sample size considerations Creation of analysis and holdout samples • Stage 2 Assumptions Normality of independent variables Linearity of relationships Lack of multicollinearity among independent variables Equal dispersion matrices • Stage 3
  • 28. Estimation of the Discriminant Functions Simultaneous or stepwise estimation Significance of discriminant functions Asses Predictive Accuracy with Classification Matrices Determine optimal cutting score Specify criterion for assessing hit ratio Statistical significance of predictive accuracy • Stage 4
  • 29. From stage 5 Stage 6 Validation of Discriminant Results Split-sample or cross-validation Profiling group differences
  • 30. Stage 5 Interpretation of the Discriminant Functions How many functions will be interpreted One Evaluation of single function Discriminant weights Discriminant loadings Partial F values Two or more Evaluation of Separate Functions Discriminant weights Discriminant loadings Partial F values Evaluation of Combined Functions Rotation of functions Potency index Graphical display of group centroids Graphical display of loadings
  • 31. Stage 2: Selecting Dependent And Independent Variables • First must specify which variables are to be independent (must be metric) and which variable is to be the dependent variable (must be non metric). • The number of dependent variables groups must be mutually exclusive and exhaustive. • Dependent variable categories should be different and unique on the independent variables. Otherwise it will not be able to uniquely profile each group, resulting in poor explanation and classification.
  • 32. Converting metric variables. • Some situations dependent variable is not true categorical. We may use ordinal or interval measurement as a categorical dependent variable by creating artificial groups. • Consider using extreme groups to maximize the group difference. • This method is called polar extremes approach.
  • 33. The Independent Variables • Independent variables are selected in two ways ▫ Identifying variables either from previous research or from the theoretical model. ▫ Utilizing the researcher’s knowledge. • In both instances, independent variables must identify differences between at least two groups of the dependent variable.
  • 34. Sample Size • Overall sample size ▫ Have 20 cases per independent variable, with a minimum recommended level of 5 observations per variable. • Sample size per category ▫ The smallest group size of a category must exceed the number of independent variables. ▫ Wide variations in the group’s size will impact the estimation of the discriminant function and the classification of observations. • To maintain an sufficient sample size both overall and for each group
  • 35. Division of the sample • Discriminant analysis is to divide the sample into two sub samples ▫ One (analysis sample) for estimation of the discriminant function ▫ Another (holdout sample) for validation purposes.  The sample dividing into 50-50 or 60-40 or 75-25 depending on the overall sample size.
  • 36.  If the categorical groups are equally represented in the total sample, then the analysis and holdout sample approximately equal size (i.e. sample consist of 50 males and 50 females, the analysis and holdout samples would have 25 males and 25 females.)  If the original groups are unequal, the sizes of the analysis and holdout samples should be proportionate to the total sample distribution. (i.e. sample contained 70 females and 30 males, then the analysis and holdout samples would consist of 35 females and 15 males.)
  • 37. Stage 3: Assumptions • The most important assumptions is the equality of the covariance matrices, which affects both estimation and classification IMPACT ON ESTIMATION: • The sample sizes are small and the covariance matrices are unequal, then the estimation process is affected. • Data not follow the normality assumption can cause problems in the estimation.
  • 38. Impact On Classification • Unequal covariance matrices affect the classification process. • The effect can be minimized by increasing the sample size and also by using the group specific covariance matrices for classification purpose. • Multicollinearity among the independent variables can reduces the estimated impact of independent variables in the derived discriminant functions, particularly if a stepwise estimation process is used.
  • 39. Stage 4: Estimation of the Model • Deriving the discriminant function is to choose the estimation method ▫ 1. The simultaneous (direct) method ▫ 2. The stepwise method • 1. Simultaneous method: It involves computing the discriminant function so that all of the independent variables are considered at the same time.
  • 40. 2. Stepwise Method: It involves entering the independent variables into the discriminant function one at a time on the basis of their discrimination power. Step 1 – Choose the single best discriminating variable. Step 2 – Add the other independent variable with the initial variable, one at a time, and select the best variable to improve the discriminating power of the function in combination with the first variable. Select additional variable in a manner. Step 3 - Variables that are not useful in discriminating between the groups are eliminated. Stepwise estimation becomes less stable when sample size decline below the recommended level of 20 observation per independent variable.s
  • 41. Statistical significance • The researcher must assess the level of significance of discriminant function. Significance test can be done on the basis of number of statistical criteria of 0.05 or beyond is used. If high level of risk significance level of 0.2 or 0.3 is fixed. Overall significance: 1.Simultaneous estimation: The measures of Wilk’s lambda, Hotelling’s trace, and Pillai’s criterion all evaluate the statistical significance of the discriminant function. 2.Stepwise Estimation: It is used to estimate the discriminant function, the Mahalanobis D2 and Rao’s V measures are most appropriate. Mahalanobis D2 based on squared Euclidean distance that adjusts for unequal variances.
  • 42. Assessing overall model fit • This assessment involves three tasks ▫ Calculating discriminant Z scores for each observation ▫ Evaluating group differences on the discriminant Z scores ▫ Assessing group membership prediction accuracy
  • 43. Calculating Discriminant Z scores • The discriminant function can be expressed with either standardized or unstandardized weights and values. • The standardized version is more useful for interpretation purpose. • The unstandardized version is easier to use in calculating the discriminant Z score.
  • 44. Evaluating Group Differences • The group differences is a comparison of the group centroids, the average discriminant Z score for all group members. • The differences between centroids are measured in terms of Mahalanobis D2 measure.
  • 45. Assessing Group membership Prediction Accuracy • The dependent variable in nonmetric, it is not possible measure such as R2 to assess predictive accuracy. • Rather each observation must be assessed. In doing so, several major considerations must be addressed: ▫ Developing classification matrices ▫ Cutting score calculation ▫ construction of the classification matrices ▫ Standards for assessing classification accuracy.
  • 46. Classification matrix • This is also called prediction matrix. • The correctly classified cases appear on diagonal because the predicted and actual group are same. • Off diagonal represents cases that have been incorrectly classified. • The sum of diagonal elements divided by number of cases represent hit ratio.
  • 47. Cutting score • Criterion against which each individual's discriminant Z score is compared to determine predicted group membership. • It represents the dividing point used to classify observation into one of two groups based on their discriminant function score. • Optimal cutting score: Discriminant Z score value that best separates the groups on each discriminant function for classification purposes.
  • 48. Optimal Cutting Score with Equal Samples Sizes Group A Group B _ ZA Classify as A (Nonpurchaser) _ ZB Classify as B (Purchaser)
  • 49. Optimal Cutting Score with Unequal Samples Sizes Optimal Weighted Cutting Score Unweighted Cutting Score Group B Group A _ ZA _ ZB
  • 50. Stage 5: Interpretation of the results Three methods of determining the relative importance of each independent variable. • Standardized discriminant weights(coefficients) • Discriminant loadings ( structure correlations) • Partial F values
  • 51. Discriminant weights(coefficient) • To interpreting discriminant functions examines the sign and the coefficient assigned to each variable in computing the discriminant functions. • Independent variables with larger coefficients contribute more to the discriminating power of the function than variables with smaller coefficients. • The interpretation of discriminant coefficients is similar to the interpretation of beta coefficients in regression analysis.
  • 52. Discriminant loadings • It is referred as structure correlations. • Loadings are increasingly used as a basis for interpretation because of the deficiencies in utilizing coefficients. • Unique characteristic: Loadings can be calculated for all variables, whether they were used in the estimation of the discriminant function or not. Particularly useful in stepwise estimation. • Loadings are more valid than coefficients for interpreting the discriminant power of independent variables because of their correlation nature. • Loadings exceeding ±.40 are considered substantive for interpretation purpose.
  • 53. Partial F values • When the stepwise method is selected, use of partial F values interpreting the discriminant power of the independent variables. • Large F values indicate greater discriminatory power.
  • 54. Validation • Discriminant loadings are the preferred method to assess the contribution of each variable to a discriminant function because they are: ▫ A standardized measure of importance (ranging from 0 to 1) ▫ Available for all independent variables whether used in the estimation process or not ▫ Unaffected by multicollinearity The discriminant function must be validated either with a holdout sample or one of the “leave one out” procedures.
  • 56. Cluster Analysis  Statistical classification technique in which cases, data, or objects (events, people, things, etc .) are sub-divided into groups (clusters) such that the items in a cluster are very similar (but not identical) to one another and very different from the items in other clusters.
  • 57. Application of Cluster analysis Research Problem Select objectives: Stage 1 Taxonomy description Data simplification Reveal relationships Select clustering variables Research Design Issues Stage 2 Can outliers be detected? Should the standardized? data be
  • 58. Stage 2 continue Select a Similarity Measure Are the cluster variables metric or non metric? Non metric Data: Metric data Association of Similarity Matching coefficients Is the focus on pattern or proximity? Standardization Options Standardizing variables Standardizing by observation Proximity: Pattern: Distance Measures of Similarity correlation Measure of Similarity Euclidean distance Correlation coefficient City-bloc distance Mahalanobis distance To stage 3
  • 59. From stage 2 Assumptions Is the sample representative of the population? Stage 3 Is Multicollinearity substantial enough to affect results? Selecting a Clustering Algorithm Stage 4 Is a hierarchical, nonhierarchical, or combination of the two methods used? Hierarchical methods Nonhierarchical methods Combination Linkage methods available: Assignment methods available: Single linkage Parallel threshold Complete linkage Optimization Average linkage Selecting seed points Use a hierarchical method to specify cluster seed points for a non hierarchical method Sequential threshold Ward’s method Centroid method How many clusters are formed? Examine increases in agglomeration coefficient Examine dendrogram and vertical icicle plots Conceptual considerations Cluster analysis Re-specification Were any observations deleted as: Outliers? Members of small clusters? No Yes
  • 60. From Stage 4 Stage 5 Interpreting the Clusters Examine cluster centroids Name clusters based clustering variables on Validating and Profiling the Clusters Stage 6 Validation with outcome variables selected Profiling with additional descriptive variables
  • 61. Governing principle Maximization of homogeneity within clusters and simultaneously Maximization of heterogeneity across clusters
  • 62. Three Basic Questions: 1. How to measure similarity? 2. How to form clusters? (extraction method) 3. How many clusters?
  • 63. Answers to First Two Basic Questions: 1. How to measure similarity? • Distance – squared Euclidean. 2. How to form clusters? • Hierarchical – Wards method.
  • 64. Third Basic Question: How many clusters? 1. Run cluster; examine solutions for two, three, four, etc. clusters ?? 2. Select number of clusters based on “a priori” criteria, practical judgment, common sense, theoretical foundations, and statistical significance.
  • 65. Steps in Cluster Analysis: 1. Identify the variables to be clustered. 2. Determine if clusters exist. To do so, verify the clusters are statistically different and theoretically meaningful. 3. Make an initial decision on how many clusters to use. 4. Where possible, validate clusters using an external variable. 5. Describe the characteristics of the derived clusters using demographics, psychographics, etc.
  • 66. Objectives of cluster analysis  Goal of cluster analysis is to partition a set of object into two or more groups based on the similarity  Cluster analysis is used for  Taxonomy description: Identifying natural groups within the data.  Data simplification: The ability to analyze groups of similar observations instead of all individual observation.  Relationship identification: The simplified structure from cluster analysis portrays relationships not revealed otherwise.
  • 67. Research Design in Cluster Analysis • • • Outliers. Similarity/Distance Measures. Standardizing the Data.
  • 68. Outliers  In a set of numbers, a number that is much larger or much smaller than the rest of the numbers is called an Outlier.  Outliers" are values that "lie outside" the other values.
  • 69. Similarity measure  Three different forms of similarity measures are:  Correlation Measures (require metric data)  Having widespread application, represent patterns rather than proximity.  Distance Measures (require metric data)  Best represents the concept of proximity, which is fundamental to cluster analysis.  Association Measures (require nonmetric data)  Represent the proximity of objects across a set of nonmetric variables.
  • 70. Types of distance measures  Euclidean distance  Squared (or absolute) Euclidean distance  City – block (Manhattan) distance  Chebychev distance  Mahalanobis distance (D2 )
  • 71. Euclidean distance Y * B (x2, y2) y2-y1 A * (x1, y1) x2-x1 X d = . 2 2 (x2-x1) + (y2-y1)
  • 72.  Squared (or absolute) Euclidean distance  It is the sum of the squared differences without taking the square root.  It is the distance measure for the Centroid and Ward’s methods of clustering.  City- block distance  Uses the sum of the absolute differences of the variables (i.e.) The two sides of a right triangle rather than the hypotenuse.  Simplest to calculate, but may lead to invalid clusters if the clustering variables are highly correlated.
  • 73.  Chebychev distance  Another distance measure. It is particularly susceptible to differences in scales across the variables.  Mahalanobis (or correlation) distance (D2 )  This measure uses the correlation coefficient between the observations and uses that as a measure to cluster them.
  • 74. Standardizing the data  Cluster analysis using distance measures are quite sensitive to differing scales or magnitudes among the variables.  Distance measures that use unstandardized data the scale of the variables is changed.
  • 75. Standardizing the variables  Common form of standardization is the conversion of each variable to standard scores (i.e. Z score)  By subtracting the mean and dividing by the standard deviation for each variable.  It eliminates the effects due to scale differences not only across variables, but for the same variable.  Clustering variables should be standardized whenever possible to avoid problems resulting from the use of different scale values among clustering variables.  A measure of Euclidean distance that directly incorporates a standardization procedure is the Mahalanobis distance (D2 ).
  • 76. Standardizing by observation  Standardizing by respondent would standardize each question not to the sample’s average but instead to that respondent’s average score.  Within case or row centering standardization can be quite effective in removing response style effects  Standardization provides a remedy to a fundamental issue in similarity measures and distance measures.
  • 77. Cluster Analysis Assumptions: Representative Sample. • The cluster analysis is only as good as the representativeness of the sample. Therefore, all efforts should be made to ensure that the sample is representative and the results are generalizable to the population.
  • 78. Minimal Multicollinearity. • Input variables should be examined for strong multicollinearity and if present: • • Reduce the variables to equal numbers in each set of correlated measures, or Use a distance measure that compensates for the correlation, such as Mahalanobis distance.
  • 79. Deriving Clusters and Assessing Overall Fit  With the clustering variables selected and the similarity matrix calculated, the partitioning process begins.  The researcher must:  Select the partitioning procedure used for forming clusters.  Make the decision on the number of clusters to be formed.
  • 80. Partitioning Procedures  To maximize the differences between clusters relative to the variation within the cluster.  The most widely used procedures can be classified as  Hierarchical  Nonhierarchical
  • 82. Hierarchical Clustering  Two main types of hierarchical clustering  Agglomerative:  Start with the points as individual clusters  At each step, merge the closest pair of clusters until only one cluster (or k clusters) left  Divisive:  Start with one, all-inclusive cluster  At each step, split a cluster until each cluster contains a point (or there are k clusters)  Traditional hierarchical algorithms use a similarity or distance matrix  Merge or split one cluster at a time
  • 83. Agglomerative OBS 1 * OBS 2 Step 0: Each observation is treated as a separate cluster * OBS 3 * OBS 4 * OBS 5 Distance Measure * OBS 6 * 0,2 0,4 0,6 Divisive 0,8 1,0
  • 84. Cluster 1 OBS 1 * OBS 2 Step 1: Two observations with smallest pairwise distances are clustered * OBS 3 * OBS 4 * OBS 5 * OBS 6 * 0,2 0,4 0,6 0,8 1,0
  • 85. Cluster 1 OBS 1 * OBS 2 Step 2: Two other observations with smallest distances amongst remaining points/clusters are clustered Cluster 2 * OBS 3 * OBS 4 * OBS 5 * OBS 6 * 0,2 0,4 0,6 0,8 1,0
  • 86. Cluster 1 OBS 1 * OBS 2 OBS 3 Step 3: Observation 3 joins with cluster 1 Cluster 2 * * OBS 4 * OBS 5 * OBS 6 * 0,2 0,4 0,6 0,8 1,0
  • 87. OBS 1 * OBS 2 * OBS 3 * “Supercluster” OBS 4 * Step 4: Cluster 1 and 2 - from Step 3 joint into a “Supercluster” OBS 5 * OBS 6 * 0,2 A single observation remains unclustered (Outlier) 0,4 0,6 0,8 1,0
  • 88. Five Most Agglomerative Algorithms  Single linkage: Smallest distance from any object in one cluster to any object in the other.  Complete linkage: Largest distance between an observation in one cluster and an observation in the other.  Average linkage: Average distance between an observation in one cluster and an observation in the other.  Centroid Method: Distance between the centroids of two clusters.  Ward’s Method: Between two clusters is the difference between the total within cluster sum of squares for the two clusters separately, and the within cluster sum of squares resulting from merging the two clusters in cluster 88
  • 89. Agglomerative Algorithms * * * Single Linkage: * * * Average Linkage: Average distance * * ¤ * * * * * ¤* * * Minimum distance * * * ¤ * * Wards method: * * * ¤ * * * Minimization of within-cluster variance Complete Linkage: Maximum distance Centroid method: Distance between centres
  • 90. Single linkage Minimize shortest distance from cluster to point A* *G *B 7,0 C * H* 8,5 *D *E
  • 91. Complete linkage Minimize longest distance from cluster to point A* *G *B 10,5 C * *D 9,5 H* *E
  • 92. Average linkage Minimize average distance from cluster to point A* *G *B 8,5 C * 9,0 H* *D *E
  • 94. Non Hierarchical Clustering  Non hierarchical clustering is also called K- means clustering.  The process essentially has two steps:  Specify cluster seeds:  To identify starting points for each cluster known as cluster seeds. It is selected in a random process.  Assignment  To assign each observation to one of the cluster seeds based on similarity.
  • 95. Selecting seed points  How do we select the cluster seeds? Classified into two basic categories: Researcher specified:  The researcher provides the seed points based on external data. The two common sources of the seed points are prior research or data from another multivariate analysis.  Sample generated:  To generate the cluster seeds from the observations of the sample , either in systematic or random selection 
  • 96. Non Hierarchical Clustering Algorithms  Sequential Threshold method - first determine a cluster center, then group all objects that are within a predetermined threshold from the center - one cluster is created at a time  Parallel Threshold method - simultaneously several cluster centers are determined, then objects that are within a predetermined threshold from the centers are grouped  Optimizing Partitioning method - first a nonhierarchical procedure is run, then objects are reassigned so as to optimize an overall criterion.
  • 97. Pros and Cons of Hierarchical Methods  It is more popular clustering method with Ward’s method and average linkage.  Advantages:  Simplicity  Measures of similarity  Speed
  • 98. Disadvantages  Hierarchical methods can be misleading because undesirable early combinations may persist throughout the analysis and lead to artificial results.  To reduce the impact of outliers, the research analyze the data several times, each time deleting problem observations or outliers
  • 99. Combination of Both Methods  Hierarchical technique is used to generate a complete set of cluster solutions, establish the cluster solutions, profile cluster centers to act as cluster seed points, and identify outliers.  After outliers are eliminated, remaining observation can be clustered by a non hierarchical method with the cluster centers from the hierarchical results acting as the initial seed points.
  • 100. Decision on the number of cluster to be formed  Performing either a hierarchical or non hierarchical cluster analysis is determining the number of clusters.  Decision is critical for hierarchical techniques because the researcher must select the cluster solution to represent the data structure (called stopping rule).
  • 101. Interpretation of the clusters  The cluster Centroid (i.e. a mean of the cluster) is particularly useful in interpretation.  It involves calculate the distinguishing characteristics of each clusters and identifying differences between clusters.  If cluster solutions fail to show large variation then the other cluster solutions should be calculate.  The cluster centroid should be assessed based on theory or practical experience.
  • 102. Validation  Validation is essential in cluster analysis because the clusters are descriptive of structure and require additional support for their relevance.  Cross- validation  To cluster analyze separate samples  Comparing the cluster solutions  Assessing the correspondence of the results. In these instances, a common approach is  By creating two subsamples (randomly splitting the sample) and then comparing the two cluster solutions for consistency with respect to number of clusters and the cluster profiles.
  • 103. Inferring Gene Functionality  Researchers want to know the functions of new genes  Simply comparing the new gene sequences to known DNA sequences often does not give away the actual function of gene  For 40% of sequenced genes, functionality cannot be ascertained by only comparing to sequences of other known genes  Microarrays allow biologists to infer gene function even when there is not enough evidence to infer function based on similarity alone
  • 104. Microarray Analysis  Microarrays measure the activity (expression level) of the gene under varying conditions/time points  Expression level is estimated by measuring the amount of mRNA for that particular gene  A gene is active if it is being transcribed  More mRNA usually indicates more gene activity
  • 105. Microarray Experiments  Analyze mRNA produced from cells in the tissue with the        environmental conditions you are testing Produce cDNA from mRNA (DNA is more stable) Attach phosphor to cDNA to see when a particular gene is expressed Different color phosphors are available to compare many samples at once Hybridize cDNA over the micro array Scan the microarray with a phosphor-illuminating laser Illumination reveals transcribed genes Scan microarray multiple times for the different color phosphor’s
  • 106. Using Microarrays • Track the sample over a period of time to see gene expression over time •Track two different samples under the same conditions to see the difference in gene expressions Each box represents one gene’s expression over time
  • 107. Using Microarrays (cont’d)  Green: expressed only from control  Red: expresses only from experimental cell  Yellow: equally expressed in both samples  Black: NOT expressed in either control or experimental cells
  • 108. Microarray Data  Microarray data are usually transformed into an intensity matrix (below)  The intensity matrix allows biologists to make correlations between diferent genes (even if they are dissimilar) and to understand how genes functions might be related  Clustering comes into play Time: Time Y Time Z Gene 1 Intensity (expression level) of gene at measured time Time X 10 8 10 Gene 2 10 0 9 Gene 3 4 8.6 3 Gene 4 7 8 3 Gene 5 1 2 3
  • 109. Microarray Data  Microarray data are usually transformed into an intensity matrix (below)  The intensity matrix allows biologists to make correlations between different genes (even if they are dissimilar) and to understand how genes functions might be related  Clustering comes into play Time: Time Y Time Z Gene 1 Intensity (expression level) of gene at measured time Time X 10 8 10 Gene 2 10 0 9 Gene 3 4 8.6 3 Gene 4 7 8 3 Gene 5 1 2 3
  • 110. Clustering of Microarray Data  Plot each datum as a point in N-dimensional space  Make a distance matrix for the distance between every two gene points in the N-dimensional space  Genes with a small distance share the same expression characteristics and might be functionally related or similar!  Clustering reveal groups of functionally related genes
  • 111. Hierarchical clustering Step 1: Transform genes * experiments matrix into genes * genes distance matrix Exp 1 Exp 2 Exp 3 Gene A Exp 4 Gene A Gene B Gene C Step 2: Cluster genes based on distance matrix and draw a dendrogram until single node remains Gene A Gene B Gene C Gene B Gene C 0 ? ? 0 ? 0
  • 112. Data and distance matrix Genes A B C D E A 0.0 Patients A B C D E B 223.6 0.0 1 90 190 90 200 150 2 190 390 110 400 200 C 80.0 297.3 0.0 D 237.1 14.1 310.2 0.0 E 60.8 194.2 108.2 206.2 0.0
  • 113. Hierarchical clustering (continued) G1 G2 G3 G4 G5 G1 0 2 6 10 9 G2 0 5 9 8 G3 0 4 5 G4 0 3 G5 G (12) G3 G4 G5 G (12) 0 6 10 9 2 3 4 G4 G5 0 4 5 0 3 0 0 G (12) G3 G (45) 1 G3 5 Stage P5 P4 P3 P2 P1 G (12) 0 6 10 G3 G (45) 0 5 0 Groups [1], [2], [3], [4], [5] [1 2], [3], [4], [5] [1 2], [3], [4 5] [1 2], [3 4 5] [1 2 3 4 5]
  • 114. Clustering of Microarray Data (cont’d) Clusters
  • 120.
  • 121.     Factor analysis is an interdependence technique. Analyzing the correlation among a large number of variables. To summarize the information with a minimum loss of information. In factor analysis, we group variables by their correlations, such that variables in a group (factor) have high correlations with each other.
  • 122. • Research problem ▫ Is the analysis exploratory or confirmatory? ▫ Select objectives:  Data summarization  Data reduction • Confirmatory ▫ Structural equation modeling • Exploratory ▫ Select the type of Factor Analysis What is being grouped – variables or cases? • Cases ▫ Q – type factor analysis or cluster analysis.
  • 123. • Variables ▫ R – type factor analysis. • Research design ▫ What variables are included? ▫ How are the variables measured? ▫ What is the desired sample size? • Assumptions ▫ Statistical considerations of normality, linearity, and homoscedasticity. ▫ Homogeneity of sample ▫ Conceptual linkages
  • 124. • Selecting a Factor Method ▫ Is the total variance or only common variance analyzed? • Total variance ▫ Extract factors with component analysis • Common variance ▫ Extract factors with common factor analysis • Specifying the Factor Matrix ▫ Determine the number of factors to be retained
  • 125. • Selecting a Rotational Method ▫ Should the factors be correlated (oblique) or uncorrelated (orthogonal)? • Orthogonal Methods ▫ VARIMAX ▫ EQUIMAX ▫ QUARTIMAX • Oblique Methods ▫ Oblimin ▫ Promax ▫ Orthoblique • Interpreting the Rotated Factor Matrix ▫ Can significant loadings be found? ▫ Can factors be named? ▫ Are communalities sufficient?  If no, selecting a factor method  If yes, go to factor model respecification
  • 126. • Factor model respecification ▫ Were any variables deleted? ▫ Do you want to change the number of factors? ▫ Do you want another type of rotation? If yes, selecting a Factor Method If no, go to validation of the factor matrix • Validation of the Factor Matrix ▫ Split/multiple samples ▫ Separate analysis for subgroups ▫ Identify influential cases
  • 127. • Additional Uses ▫ Selection of Surrogate Variables ▫ Computation of Factor Scores ▫ Creation of Summated Scales
  • 128. • Factor – summarize the original set of observed variables • Factor loadings – correlation between original variables and the factors. • Squared factor loadings – percentage of the variance in an original variable is explained by a factor.
  • 129. Communality • In factor Analysis ,a measure of the percentage of a variable’s variation that is explained by the factors . • A relative high communality indicates that a variable has much in common with the other variables taken as a group.
  • 130. Specifying the Unit of Analysis • First select the unit of analysis for factor analysis ▫ Variables (or) ▫ Respondents • Factor analysis would be applied to a correlation matrix of the variables. o R factor analysis – common type of factor analysis for variables • Factor analysis also may be applied to a correlation matrix of the individual respondents based on their characteristics. o Q factor analysis – Respondents o Q factor analysis is not utilized frequently because of difficult to calculate. o Instead , some type of cluster analysis is used to group individual respondents.
  • 131. Data summarization • Explain the data in a smaller number of concepts that equally represent the original set of variables.
  • 132. Variable selection • Whether factor analysis is used for data reduction or summarization, should consider the conceptual basis of the variables. • In assessing the dimension of store image, if no question on store workers were included, factor analysis would not be able to identify this dimension.
  • 133. • Factor analysis is the “garbage in , garbage out” phenomenon. • If the researcher includes a large number of variables and hopes that factor analysis will “figure it out,” then there is a high possibility of poor results .
  • 134. Designing a factor analysis • Factor analysis involves three basic decisions: ▫ Correlation among variables or respondents ▫ Variables selection and measurement issues ▫ Sample size
  • 135. Correlations among variables or respondents • Two forms of factor analysis. Both utilize a correlation matrix as the basic data input. ▫ R type - use a traditional correlation matrix as input. ▫ Q type – use a factor matrix that would identify similar individuals. Q factor analysis is different from cluster analysis. Q – type factor analysis form grouping based on the intercorrelations between the respondents. Cluster analysis form grouping based on a distance based similarity measure.
  • 136. Variable selection and measurement issues • Two specific questions must be answered: ▫ What type of variables can be used in factor analysis? ▫ How many variables should be included? • Correlations are easy to find in metric variables. • Non metric variables are more problematic . • To define dummy variables (coded 0-1) to represent categories of non metric variables then correlation is possible to find. • Boolean factor analysis are more appropriate if all the variables are dummy variables. • If a study is being designed to reveal factor structure, strive to have at least five variables for each proposed factor.
  • 137. Sample size • For sample size: ▫ The sample must have more observations than variables. ▫ The minimum absolute sample size should be 50 observations. • Maximize the number of observations per variable, with a minimum of 5 and at least 10 observation per variable.
  • 138. Need of factor analysis • The difficulties in a having too many independent variables in predicting the response variable are : ▫ ▫ ▫ ▫ ▫ Increased computational time to get solution . Increased time in data collection Too much expenditure in data collection Presence of redundant independent Difficulty in making inference . • These can be avoided using Factor Analysis
  • 139. • Factor analysis aims at grouping the original input variables into factors which underlie the input variables • The total no of factors = total no of input variables But after performing Factor Analysis • The total no of factors in the study can be reduced by dropping the insignificant factors based on Certain Criteria
  • 140. Objective of factor analysis • The main objective of Factor analysis is to summarize a large number of underlying factors into a smaller number of variables or factors which represent the basic factors underlying the data. • Factor analysis is used to uncover the latent structure(dimensions) of a set of variables. • It reduces attribute space from a larger number of variables to a smaller number of factors and as such is a “nondependent" procedure (that is, it does not assume a dependent variable is specified).
  • 141. Assumptions • Factor analysis is designed for interval data, although it can also be used for ordinal data • The variables used in factor analysis should be linearly related to each other. This can be checked by looking at scatter plots of pairs of variables. • Obviously the variables must also be at least moderately correlated to each other, otherwise the number of factors will be almost the same as the number of original variables, which means that carrying out a factor analysis would be pointless.
  • 142. Method of determining the appropriateness of factor analysis • If correlations is not greater than 0.30 then factor analysis is probably in appropriate. • The correlations among variables can also be analyzed by computing the partial correlations among variables. • If partial correlations are high then factor analysis is inappropriate. • Partial correlation should be small, because the variable can be explained by the variables loading on the factors.
  • 143. • Bartlett test of sphericity: ▫ It is a statistical test for the presence of correlations among the variables. ▫ A statistically significant Bartlett’s test of sphericity (sig >0.50) indicates that sufficient correlations exist among the variables to proceed. • Measure of sampling adequacy (MSA): o MSA value must exceed 0.50 for both the overall test and each individual variable. o Variables with values less than 0.50 should be omitted from the factor analysis one at a time, with the smallest one being omitted each time. • The MSA increases as: o o o o The sample size increases The average correlations increase The number of variables increases The number of factors decreases.
  • 144. Selecting a Factor extraction method • Before selecting the methods of factor extraction, must have some understanding of the variance for a variables and how it is divided or partitioned. • For the purpose of factor analysis, it is important to understand how much of a variables variance is shared with other variables.
  • 145. • The total variance of any variable can be divided into three types of variance. ▫ Common variance: Variance in a variable that is shared with all other variables in the analysis. ▫ Specific variance (unique variance): variance associated with only a specific variable. This variance cannot be explained by the correlation to the other variables ▫ Error variance: It is also variance that cannot be explained by correlations with other variables.  As a variable is more highly correlated with one or more variables, the common variances (communality) increases.  Unreliable measures or other sources of error variance are introduced, then the common variance is reduced.
  • 146. Factor analysis Vs principal component analysis Factor analysis Principal component analysis • It analyzes only the variance shared among the variables (common variance without error or unique variance). • It adjusts the diagonals of the correlation matrix with the unique factors. • The component score in PCA represent a linear combination of the observed variables weighted by Eigen vectors. • PCA do not represent underlying constructs. • It analyzes total variance. • It inserts 1’s on the diagonals of the correlation matrix. • The observed variables in FA are linear combination of the underlying and unique factors. • FA underlying constructs can be labeled and readily interpreted, given an accurate model specification. • Both models yield similar results if the number of variables exceeds 30 or the communalities exceed 0.60.
  • 147. Number of factors to extract • Any decision on the number of factors to be retained should be based on several considerations: ▫ Factors with Eigen values greater than 1.0 ▫ A predetermined number of factors based on research objectives and prior research ▫ Enough factors to meet a specified percentage of variance explained, usually 60% or higher. ▫ Factors shown by the scree test to have substantial amounts of common variance.
  • 149. Interpreting the factors • The three processes of factor interpretation ▫ Estimate the factor matrix  First unrotated factor matrix is computed, containing the factor loadings for each variable on each factor. ▫ Factor rotation ▫ Factor interpretation and respecification
  • 150. 150 Rotating factors • When the factors are extracted, factor loading is obtained. Factor loadings are the correlation of each variable and the factor. When rotating the factors, the variance has been redistributed so that the factor loading pattern and percentage of variance for each of the factors is difference. • The objectives of rotating is to redistribute the variance from earlier factors to later ones to achieve a simple, theoretically more meaningful factor pattern, and make the result easily to be interpreted • Two type of factor rotation 1. Orthogonal factor rotation 2. Oblique facto rotation
  • 151. Orthogonal factor rotation • Orthogonal rotation the axes are maintained at right angles. The objective of all methods of rotation is to simplify the rows and columns of the factor matrix. • By simplifying the rows, making as many values in each row as close to zero as possible. • By simplifying the columns, making as many values in each column as close to zero as possible. There are three rotation methods • 1) Quartimax, • 2) Varimax, • 3) Equimax.
  • 152. Orthogonal factor rotation Unrotated factor II Rotated factor II +1.0 V1 V2 +.50 -1.0 -.50 0 +.50 +1.0 V4 -.50 -1.0 V5 Unrotated factor I V3 Rotated factor I
  • 153. 153 1.The quartimax rotation is to simplify the rows of a factor matrix, i.e. focus on rotating the initial factor so that a variable loads high on one factor and as low as possible on all other factor. 2. The varimax rotation is to simplify the columns of the factor matrix. With this approach, the maximum possible simplification is reached of there are only 1’s and 0’s in a column. 3.The equimax rotation is a compromise between the quartimax and varimax. In practice, the first two are the most common one to apply.
  • 154. Oblique rotation methods • Oblique rotations are similar to orthogonal rotation. • It allow correlated factors instead of maintaining independence between the rotated factors. • Oblique rotation the axes need not be maintained at right angle. • It represents the clustering of variables more accurately. • There are three rotation methods ▫ Oblimin ▫ Promax ▫ orthoblique
  • 155. Oblique factor rotation Unrotated factor II +1.0 Orthogonal Rotated factor II V1 Oblique rotation factor II V2 +.50 -1.0 -.50 0 +.50 +1.0 Unrotated factor I V3 V4 -.50 -1.0 V5 Oblique rotation I Orthogonal Rotated factor I
  • 156. Assessing factor analysis • In interpreting factors, a decision must be made regarding the factor loadings worth consideration and attention. • Loadings exceeding 0.70 are considered indicative of welldefined structure and are the goal of any factor analysis.
  • 157. Interpreting a factor matrix • Step 1: Examine the factor matrix of loadings. ▫ The factor loading matrix contains the factor loading of each variable on each factor. ▫ Rotated loadings are usually used in factor interpretation unless data reduction is the sole objective. ▫ An oblique rotation has been used, two matrices of factor loadings are provided.  Factor pattern matrix  Factor structure matrix
  • 158.  Factor pattern matrix  Represent the unique contribution of each variable to the factor.  Factor structure matrix  Simple correlation between variables and factors, but these loadings contain both the unique variance between variables and factors and the correlation among factors.
  • 159. Validation of factor analysis Assessing the degree of generalizability of the results to the population and potential influence of individual cases on the overall results. • Use of a confirmatory perspective: ▫ The direct method of validating the results is to use a confirmatory perspective. ▫ Assess the replicability of the results either with a split sample in the original data set or with a separate sample.
  • 160. Assessing factor structure stability • Factors stability is primarily dependent on the sample size and on the number of cases per variable. • Comparison of the two resulting factor matrices will provide an assessment of the robustness of the solution.
  • 161. Detecting influential observations • The another issue to the validation of factor analysis is the detecting influential observations. • To estimate the model with and without observations identified as outliers to assess their impact on the results.
  • 162. Additional uses of factor analysis results • Objective is ▫ To identify logical combinations of variables and better understand the interrelationships among variables, then factor interpretation will enough. ▫ To identify appropriate variables for subsequent application to other statistical techniques, then some form of data reduction will be used. ▫ There are three option for data reduction  Summated scale  surrogate variable  Factor score
  • 163. Summated Scales • One of the common uses of factor analysis is the formation of summated scales, where we add the scores on all the variables loading on a component to create the score for the component. • To verify that the variables for a component are measuring similar entities that are legitimate to add together, we compute Chronbach's alpha. • If Chronbach's alpha is 0.70 or greater (0.60 or greater for exploratory research), we have support on the interval consistency of the items justifying their use in a summated scale.
  • 164. Surrogate variable • The option of examining the factor matrix and selecting the variable with the highest factor loading on each factor to act as a surrogate variable. • The selection process is more difficult because two or more variables have loadings that are significant and close to each other. • Disadvantages of selecting a single surrogate variable ▫ It does not address the issue of measurement error. ▫ It also runs the risk of potentially misleading results by selecting only a single variable to represent a more complex result. Factor analysis calculating a summated scale or factor scores instead of the surrogate variable.
  • 165. Factor Score • A number that represents each observations calculated value on each factor in a factor analysis. • At the initial stage ,the respondents assign scores for the variables. After performing factor analysis, each factor assigns a score for each respondent. Such score are called respondent factor score. • Factor scores are standardized to have a mean of ‘0’ and a standard deviation of ‘1’.