Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Published in: Technology, Business
  • Be the first to comment


  1. 1. Multivariate Analysis Analysis of multiple variables in a single relationship or set of relationships. 
  2. 2. Some basic concepts of multivariate analysis The Variate  Measurement scales  Measurement error and multivariate measurement.  Statistical significance Vs Statistical power 
  3. 3. 1.The Variate Variate is also called linear combination.  The linear combination of variables with empirically determined weights. Variate value = w1x1 + w2x2 + ...+ wnxn x1 ,x2 ,..xn = Observed variable w1 , w2 ,.. wn = Weight  The variables are specified by the researcher.  The weights are determined by the 
  4. 4. 2.Measurement Scales Data analysis involves the identification and measurement of variation in a set of variables.  The researcher cannot identify variation unless it can be measured.  Measurement is important for representing the concept and selection of appropriate multivariate method of analysis. 
  5. 5. Measurement scales can be classified into two categories  Non metric ( Qualitative)  Metric (Quantitative)
  6. 6. Non metric measurement scales   Also called qualitative data. Measures that describe by indicating the presence or absence of characteristic or property are called non metric data. For e.g.: If a person is male, he cannot be female. An “ amount” of gender is not possible, just state of being male or female. Qualitative measurements can be made with either a nominal or an ordinal scale.
  7. 7. Nominal Scales A nominal scale assigns numbers to identify subjects or objects.  Nominal scales also known as categorical scales. For e.g. In representing gender the researcher might assign numbers to each category (i.e) Assign 2 for females Assign 1 for males. 
  8. 8. Ordinal Scales  In ordinal scales, variables can be ordered or ranked.  Every subject or object can be compared with another in terms of a “greater than” or “less than” relationship.  It provide only the order of the value, but not measure of the actual amount. For e.g. opinion scales that go from “most important” to “least important” or “strongly agree” to “strongly disagree”.
  9. 9.  Using ordinal scale we cannot find the amount of the differences between products. (i.e.) we cannot answer the question whether the difference between Product A and B is greater than the difference between Product B and C.
  10. 10. Metric Measurement Scales Metric data also called quantitative data, these measurements identify subjects not only on the possession of an attribute but also by the amount to which the subject may be characterized by the attribute. For e.g. age, height.  Two metric measurement scales are  Interval scales  Ratio scales 
  11. 11. Interval scales Data of real numbers, numbers with a zero point and can be divided and compared into other ratio numbers. For e.g. income, weight, height and age.  We can answer the question whether the difference between Product A and B is greater than the difference between Product B and C. 
  13. 13. Measurement Error  Not measuring the “true” variable values accurately due to the inappropriate response scales, data entry error, or respondent errors.  For e.g. 1. Imposing 7-point rating scales for attribute measurement when the researcher knows the respondents can accurately respond only to a 3- point rating .  2. Responses as to household income may be reasonably accurate but rarely totally precise. All variables used in multivariate techniques must be assumed to have some degree of measurement error.
  14. 14. Validity  Validity – extent to which a measure correctly represents the concept of study. (i.e.) the degree to which it is free form any systematic or nonrandom error.  validity relates not to what should be measured, but instead to how it is measured
  15. 15. Reliability Reliability is the degree to which the observed variable measures the “true” value and is “error free”.  More reliable measures will show greater consistency than less reliable measures.  Choose the variable with the higher reliability.  Validity is concerned with how well the concept is defined by the measures, whereas reliability relates to the consistency of the measures. 
  16. 16. Multivariate measurement  Use of two or more variables as indicators(i.e. single variable used in conjunction with one or more other variables to form a composite measure) of a single composite measure.  For e.g. A personality test may provide the answers to a series of individual questions, which are then combined to form a single score representing the personality trait.
  17. 17. Statistical Significance  All multivariate techniques are based on the statistical inference of a population or a randomly drawn sample of that population.  Interpreting statistical inferences, the researcher specify the acceptable levels of statistical error.  5% or 1% Level of significance – it means 95% certain that our sample results are not due to chance.
  18. 18. Types of statistical error H0 is true Accept Reject  H0 is false 1-α β Type 2 error α Type 1 error 1-β Power There are two types of errors Type 1 error Type 2 error
  19. 19.  Type 1 error is the probability of rejecting the null hypothesis when it is true. It is also known as producer’s risk. Probability of type 1 error is alpha(α).  Type 2 error is the probability of failing to reject the null hypothesis when it is false. It is also known as consumer’s risk. Probability of type 2 error is beta(β).
  20. 20. Statistical power  Power is the probability of correctly rejecting the null hypothesis when it is false.  Power of statistical inference test is 1-β  Increased sample sizes produce greater power of the statistical test.  Researchers should always design the study to achieve a power level of 0.80 at the desired significance level.
  21. 21. What type of relationship is being examined? Dependence Interdependence How many variables are being predicted? Is the structure of relationships among: One dependent variable in a single relationship Cases/Respondents What is the measurement scale of the dependent variable? Cluster analysis Metric Non metric 1.Multiple regression 2. conjoint analysis 1.Multiple discriminant analysis 2. Linear probability
  22. 22. What type of relationship is being examined? Dependence How many variables are being predicted? Multiple relationships of dependent and independent variables Several dependent variables in single relationship Structural equation modeling What is the measurement scale of the dependent variable? Metric Non metric What is the measurement scale of the predictor variable? Canonical correlation analysis with dummy variables Metric Non metric Canonical correlation analysis Multivariate analysis of variance
  23. 23. What type of relationship is being examined? interdependence Is the structure of relationships among: Variables Factor analysis Objects How are the attributes measured? Confirmatory factor analysis Metric Non metric Multidimensional scaling Correspondence analysis
  24. 24. Discriminant analysis • Discriminant analysis is used when the dependent variable is a non metric variable and the independent variables are metric variables. • Total sample can be divided into groups based on a qualitative dependent variable.
  25. 25. Discriminant Analysis • Discriminant analysis also refers to a wider family of techniques ▫ Still for discrete response, continuous predictors ▫ Produces discriminant functions that classify observations into groups  These can be linear or quadratic functions  Can also be based on non-parametric techniques
  26. 26. Objectives • To understand group differences and correctly classifying objects into groups or classes. • It is used to distinguish innovators from non innovators according to their demographic and psychographic profiles. For e.g. Distinguishing males from females, good credit from poor credit.
  27. 27. Application of discriminant analysis Research problem Select objectives: Evaluate group differences on a multivariate profile Classify observations into groups Identify dimensions of discrimination between groups • Stage 1 Research Design Issues Selection of independent variables Sample size considerations Creation of analysis and holdout samples • Stage 2 Assumptions Normality of independent variables Linearity of relationships Lack of multicollinearity among independent variables Equal dispersion matrices • Stage 3
  28. 28. Estimation of the Discriminant Functions Simultaneous or stepwise estimation Significance of discriminant functions Asses Predictive Accuracy with Classification Matrices Determine optimal cutting score Specify criterion for assessing hit ratio Statistical significance of predictive accuracy • Stage 4
  29. 29. From stage 5 Stage 6 Validation of Discriminant Results Split-sample or cross-validation Profiling group differences
  30. 30. Stage 5 Interpretation of the Discriminant Functions How many functions will be interpreted One Evaluation of single function Discriminant weights Discriminant loadings Partial F values Two or more Evaluation of Separate Functions Discriminant weights Discriminant loadings Partial F values Evaluation of Combined Functions Rotation of functions Potency index Graphical display of group centroids Graphical display of loadings
  31. 31. Stage 2: Selecting Dependent And Independent Variables • First must specify which variables are to be independent (must be metric) and which variable is to be the dependent variable (must be non metric). • The number of dependent variables groups must be mutually exclusive and exhaustive. • Dependent variable categories should be different and unique on the independent variables. Otherwise it will not be able to uniquely profile each group, resulting in poor explanation and classification.
  32. 32. Converting metric variables. • Some situations dependent variable is not true categorical. We may use ordinal or interval measurement as a categorical dependent variable by creating artificial groups. • Consider using extreme groups to maximize the group difference. • This method is called polar extremes approach.
  33. 33. The Independent Variables • Independent variables are selected in two ways ▫ Identifying variables either from previous research or from the theoretical model. ▫ Utilizing the researcher’s knowledge. • In both instances, independent variables must identify differences between at least two groups of the dependent variable.
  34. 34. Sample Size • Overall sample size ▫ Have 20 cases per independent variable, with a minimum recommended level of 5 observations per variable. • Sample size per category ▫ The smallest group size of a category must exceed the number of independent variables. ▫ Wide variations in the group’s size will impact the estimation of the discriminant function and the classification of observations. • To maintain an sufficient sample size both overall and for each group
  35. 35. Division of the sample • Discriminant analysis is to divide the sample into two sub samples ▫ One (analysis sample) for estimation of the discriminant function ▫ Another (holdout sample) for validation purposes.  The sample dividing into 50-50 or 60-40 or 75-25 depending on the overall sample size.
  36. 36.  If the categorical groups are equally represented in the total sample, then the analysis and holdout sample approximately equal size (i.e. sample consist of 50 males and 50 females, the analysis and holdout samples would have 25 males and 25 females.)  If the original groups are unequal, the sizes of the analysis and holdout samples should be proportionate to the total sample distribution. (i.e. sample contained 70 females and 30 males, then the analysis and holdout samples would consist of 35 females and 15 males.)
  37. 37. Stage 3: Assumptions • The most important assumptions is the equality of the covariance matrices, which affects both estimation and classification IMPACT ON ESTIMATION: • The sample sizes are small and the covariance matrices are unequal, then the estimation process is affected. • Data not follow the normality assumption can cause problems in the estimation.
  38. 38. Impact On Classification • Unequal covariance matrices affect the classification process. • The effect can be minimized by increasing the sample size and also by using the group specific covariance matrices for classification purpose. • Multicollinearity among the independent variables can reduces the estimated impact of independent variables in the derived discriminant functions, particularly if a stepwise estimation process is used.
  39. 39. Stage 4: Estimation of the Model • Deriving the discriminant function is to choose the estimation method ▫ 1. The simultaneous (direct) method ▫ 2. The stepwise method • 1. Simultaneous method: It involves computing the discriminant function so that all of the independent variables are considered at the same time.
  40. 40. 2. Stepwise Method: It involves entering the independent variables into the discriminant function one at a time on the basis of their discrimination power. Step 1 – Choose the single best discriminating variable. Step 2 – Add the other independent variable with the initial variable, one at a time, and select the best variable to improve the discriminating power of the function in combination with the first variable. Select additional variable in a manner. Step 3 - Variables that are not useful in discriminating between the groups are eliminated. Stepwise estimation becomes less stable when sample size decline below the recommended level of 20 observation per independent variable.s
  41. 41. Statistical significance • The researcher must assess the level of significance of discriminant function. Significance test can be done on the basis of number of statistical criteria of 0.05 or beyond is used. If high level of risk significance level of 0.2 or 0.3 is fixed. Overall significance: 1.Simultaneous estimation: The measures of Wilk’s lambda, Hotelling’s trace, and Pillai’s criterion all evaluate the statistical significance of the discriminant function. 2.Stepwise Estimation: It is used to estimate the discriminant function, the Mahalanobis D2 and Rao’s V measures are most appropriate. Mahalanobis D2 based on squared Euclidean distance that adjusts for unequal variances.
  42. 42. Assessing overall model fit • This assessment involves three tasks ▫ Calculating discriminant Z scores for each observation ▫ Evaluating group differences on the discriminant Z scores ▫ Assessing group membership prediction accuracy
  43. 43. Calculating Discriminant Z scores • The discriminant function can be expressed with either standardized or unstandardized weights and values. • The standardized version is more useful for interpretation purpose. • The unstandardized version is easier to use in calculating the discriminant Z score.
  44. 44. Evaluating Group Differences • The group differences is a comparison of the group centroids, the average discriminant Z score for all group members. • The differences between centroids are measured in terms of Mahalanobis D2 measure.
  45. 45. Assessing Group membership Prediction Accuracy • The dependent variable in nonmetric, it is not possible measure such as R2 to assess predictive accuracy. • Rather each observation must be assessed. In doing so, several major considerations must be addressed: ▫ Developing classification matrices ▫ Cutting score calculation ▫ construction of the classification matrices ▫ Standards for assessing classification accuracy.
  46. 46. Classification matrix • This is also called prediction matrix. • The correctly classified cases appear on diagonal because the predicted and actual group are same. • Off diagonal represents cases that have been incorrectly classified. • The sum of diagonal elements divided by number of cases represent hit ratio.
  47. 47. Cutting score • Criterion against which each individual's discriminant Z score is compared to determine predicted group membership. • It represents the dividing point used to classify observation into one of two groups based on their discriminant function score. • Optimal cutting score: Discriminant Z score value that best separates the groups on each discriminant function for classification purposes.
  48. 48. Optimal Cutting Score with Equal Samples Sizes Group A Group B _ ZA Classify as A (Nonpurchaser) _ ZB Classify as B (Purchaser)
  49. 49. Optimal Cutting Score with Unequal Samples Sizes Optimal Weighted Cutting Score Unweighted Cutting Score Group B Group A _ ZA _ ZB
  50. 50. Stage 5: Interpretation of the results Three methods of determining the relative importance of each independent variable. • Standardized discriminant weights(coefficients) • Discriminant loadings ( structure correlations) • Partial F values
  51. 51. Discriminant weights(coefficient) • To interpreting discriminant functions examines the sign and the coefficient assigned to each variable in computing the discriminant functions. • Independent variables with larger coefficients contribute more to the discriminating power of the function than variables with smaller coefficients. • The interpretation of discriminant coefficients is similar to the interpretation of beta coefficients in regression analysis.
  52. 52. Discriminant loadings • It is referred as structure correlations. • Loadings are increasingly used as a basis for interpretation because of the deficiencies in utilizing coefficients. • Unique characteristic: Loadings can be calculated for all variables, whether they were used in the estimation of the discriminant function or not. Particularly useful in stepwise estimation. • Loadings are more valid than coefficients for interpreting the discriminant power of independent variables because of their correlation nature. • Loadings exceeding ±.40 are considered substantive for interpretation purpose.
  53. 53. Partial F values • When the stepwise method is selected, use of partial F values interpreting the discriminant power of the independent variables. • Large F values indicate greater discriminatory power.
  54. 54. Validation • Discriminant loadings are the preferred method to assess the contribution of each variable to a discriminant function because they are: ▫ A standardized measure of importance (ranging from 0 to 1) ▫ Available for all independent variables whether used in the estimation process or not ▫ Unaffected by multicollinearity The discriminant function must be validated either with a holdout sample or one of the “leave one out” procedures.
  55. 55. Cluster Analysis
  56. 56. Cluster Analysis  Statistical classification technique in which cases, data, or objects (events, people, things, etc .) are sub-divided into groups (clusters) such that the items in a cluster are very similar (but not identical) to one another and very different from the items in other clusters.
  57. 57. Application of Cluster analysis Research Problem Select objectives: Stage 1 Taxonomy description Data simplification Reveal relationships Select clustering variables Research Design Issues Stage 2 Can outliers be detected? Should the standardized? data be
  58. 58. Stage 2 continue Select a Similarity Measure Are the cluster variables metric or non metric? Non metric Data: Metric data Association of Similarity Matching coefficients Is the focus on pattern or proximity? Standardization Options Standardizing variables Standardizing by observation Proximity: Pattern: Distance Measures of Similarity correlation Measure of Similarity Euclidean distance Correlation coefficient City-bloc distance Mahalanobis distance To stage 3
  59. 59. From stage 2 Assumptions Is the sample representative of the population? Stage 3 Is Multicollinearity substantial enough to affect results? Selecting a Clustering Algorithm Stage 4 Is a hierarchical, nonhierarchical, or combination of the two methods used? Hierarchical methods Nonhierarchical methods Combination Linkage methods available: Assignment methods available: Single linkage Parallel threshold Complete linkage Optimization Average linkage Selecting seed points Use a hierarchical method to specify cluster seed points for a non hierarchical method Sequential threshold Ward’s method Centroid method How many clusters are formed? Examine increases in agglomeration coefficient Examine dendrogram and vertical icicle plots Conceptual considerations Cluster analysis Re-specification Were any observations deleted as: Outliers? Members of small clusters? No Yes
  60. 60. From Stage 4 Stage 5 Interpreting the Clusters Examine cluster centroids Name clusters based clustering variables on Validating and Profiling the Clusters Stage 6 Validation with outcome variables selected Profiling with additional descriptive variables
  61. 61. Governing principle Maximization of homogeneity within clusters and simultaneously Maximization of heterogeneity across clusters
  62. 62. Three Basic Questions: 1. How to measure similarity? 2. How to form clusters? (extraction method) 3. How many clusters?
  63. 63. Answers to First Two Basic Questions: 1. How to measure similarity? • Distance – squared Euclidean. 2. How to form clusters? • Hierarchical – Wards method.
  64. 64. Third Basic Question: How many clusters? 1. Run cluster; examine solutions for two, three, four, etc. clusters ?? 2. Select number of clusters based on “a priori” criteria, practical judgment, common sense, theoretical foundations, and statistical significance.
  65. 65. Steps in Cluster Analysis: 1. Identify the variables to be clustered. 2. Determine if clusters exist. To do so, verify the clusters are statistically different and theoretically meaningful. 3. Make an initial decision on how many clusters to use. 4. Where possible, validate clusters using an external variable. 5. Describe the characteristics of the derived clusters using demographics, psychographics, etc.
  66. 66. Objectives of cluster analysis  Goal of cluster analysis is to partition a set of object into two or more groups based on the similarity  Cluster analysis is used for  Taxonomy description: Identifying natural groups within the data.  Data simplification: The ability to analyze groups of similar observations instead of all individual observation.  Relationship identification: The simplified structure from cluster analysis portrays relationships not revealed otherwise.
  67. 67. Research Design in Cluster Analysis • • • Outliers. Similarity/Distance Measures. Standardizing the Data.
  68. 68. Outliers  In a set of numbers, a number that is much larger or much smaller than the rest of the numbers is called an Outlier.  Outliers" are values that "lie outside" the other values.
  69. 69. Similarity measure  Three different forms of similarity measures are:  Correlation Measures (require metric data)  Having widespread application, represent patterns rather than proximity.  Distance Measures (require metric data)  Best represents the concept of proximity, which is fundamental to cluster analysis.  Association Measures (require nonmetric data)  Represent the proximity of objects across a set of nonmetric variables.
  70. 70. Types of distance measures  Euclidean distance  Squared (or absolute) Euclidean distance  City – block (Manhattan) distance  Chebychev distance  Mahalanobis distance (D2 )
  71. 71. Euclidean distance Y * B (x2, y2) y2-y1 A * (x1, y1) x2-x1 X d = . 2 2 (x2-x1) + (y2-y1)
  72. 72.  Squared (or absolute) Euclidean distance  It is the sum of the squared differences without taking the square root.  It is the distance measure for the Centroid and Ward’s methods of clustering.  City- block distance  Uses the sum of the absolute differences of the variables (i.e.) The two sides of a right triangle rather than the hypotenuse.  Simplest to calculate, but may lead to invalid clusters if the clustering variables are highly correlated.
  73. 73.  Chebychev distance  Another distance measure. It is particularly susceptible to differences in scales across the variables.  Mahalanobis (or correlation) distance (D2 )  This measure uses the correlation coefficient between the observations and uses that as a measure to cluster them.
  74. 74. Standardizing the data  Cluster analysis using distance measures are quite sensitive to differing scales or magnitudes among the variables.  Distance measures that use unstandardized data the scale of the variables is changed.
  75. 75. Standardizing the variables  Common form of standardization is the conversion of each variable to standard scores (i.e. Z score)  By subtracting the mean and dividing by the standard deviation for each variable.  It eliminates the effects due to scale differences not only across variables, but for the same variable.  Clustering variables should be standardized whenever possible to avoid problems resulting from the use of different scale values among clustering variables.  A measure of Euclidean distance that directly incorporates a standardization procedure is the Mahalanobis distance (D2 ).
  76. 76. Standardizing by observation  Standardizing by respondent would standardize each question not to the sample’s average but instead to that respondent’s average score.  Within case or row centering standardization can be quite effective in removing response style effects  Standardization provides a remedy to a fundamental issue in similarity measures and distance measures.
  77. 77. Cluster Analysis Assumptions: Representative Sample. • The cluster analysis is only as good as the representativeness of the sample. Therefore, all efforts should be made to ensure that the sample is representative and the results are generalizable to the population.
  78. 78. Minimal Multicollinearity. • Input variables should be examined for strong multicollinearity and if present: • • Reduce the variables to equal numbers in each set of correlated measures, or Use a distance measure that compensates for the correlation, such as Mahalanobis distance.
  79. 79. Deriving Clusters and Assessing Overall Fit  With the clustering variables selected and the similarity matrix calculated, the partitioning process begins.  The researcher must:  Select the partitioning procedure used for forming clusters.  Make the decision on the number of clusters to be formed.
  80. 80. Partitioning Procedures  To maximize the differences between clusters relative to the variation within the cluster.  The most widely used procedures can be classified as  Hierarchical  Nonhierarchical
  81. 81. Hierarchical Non overlapping Non-hierarchical Agglomerative Divisive 1a 1b 1a 1c 2 1b 1b2 1b1 Overlapping
  82. 82. Hierarchical Clustering  Two main types of hierarchical clustering  Agglomerative:  Start with the points as individual clusters  At each step, merge the closest pair of clusters until only one cluster (or k clusters) left  Divisive:  Start with one, all-inclusive cluster  At each step, split a cluster until each cluster contains a point (or there are k clusters)  Traditional hierarchical algorithms use a similarity or distance matrix  Merge or split one cluster at a time
  83. 83. Agglomerative OBS 1 * OBS 2 Step 0: Each observation is treated as a separate cluster * OBS 3 * OBS 4 * OBS 5 Distance Measure * OBS 6 * 0,2 0,4 0,6 Divisive 0,8 1,0
  84. 84. Cluster 1 OBS 1 * OBS 2 Step 1: Two observations with smallest pairwise distances are clustered * OBS 3 * OBS 4 * OBS 5 * OBS 6 * 0,2 0,4 0,6 0,8 1,0
  85. 85. Cluster 1 OBS 1 * OBS 2 Step 2: Two other observations with smallest distances amongst remaining points/clusters are clustered Cluster 2 * OBS 3 * OBS 4 * OBS 5 * OBS 6 * 0,2 0,4 0,6 0,8 1,0
  86. 86. Cluster 1 OBS 1 * OBS 2 OBS 3 Step 3: Observation 3 joins with cluster 1 Cluster 2 * * OBS 4 * OBS 5 * OBS 6 * 0,2 0,4 0,6 0,8 1,0
  87. 87. OBS 1 * OBS 2 * OBS 3 * “Supercluster” OBS 4 * Step 4: Cluster 1 and 2 - from Step 3 joint into a “Supercluster” OBS 5 * OBS 6 * 0,2 A single observation remains unclustered (Outlier) 0,4 0,6 0,8 1,0
  88. 88. Five Most Agglomerative Algorithms  Single linkage: Smallest distance from any object in one cluster to any object in the other.  Complete linkage: Largest distance between an observation in one cluster and an observation in the other.  Average linkage: Average distance between an observation in one cluster and an observation in the other.  Centroid Method: Distance between the centroids of two clusters.  Ward’s Method: Between two clusters is the difference between the total within cluster sum of squares for the two clusters separately, and the within cluster sum of squares resulting from merging the two clusters in cluster 88
  89. 89. Agglomerative Algorithms * * * Single Linkage: * * * Average Linkage: Average distance * * ¤ * * * * * ¤* * * Minimum distance * * * ¤ * * Wards method: * * * ¤ * * * Minimization of within-cluster variance Complete Linkage: Maximum distance Centroid method: Distance between centres
  90. 90. Single linkage Minimize shortest distance from cluster to point A* *G *B 7,0 C * H* 8,5 *D *E
  91. 91. Complete linkage Minimize longest distance from cluster to point A* *G *B 10,5 C * *D 9,5 H* *E
  92. 92. Average linkage Minimize average distance from cluster to point A* *G *B 8,5 C * 9,0 H* *D *E
  93. 93. Hierarchical Clustering: Comparison 1 3 5 5 4 5 2 2 5 1 2 1 MIN 3 2 MAX 6 3 3 4 4 5 5 2 4 1 4 1 5 6 4 1 2 Ward’s Method 2 3 3 6 5 2 Group Average 3 1 4 6 1 4 3
  94. 94. Non Hierarchical Clustering  Non hierarchical clustering is also called K- means clustering.  The process essentially has two steps:  Specify cluster seeds:  To identify starting points for each cluster known as cluster seeds. It is selected in a random process.  Assignment  To assign each observation to one of the cluster seeds based on similarity.
  95. 95. Selecting seed points  How do we select the cluster seeds? Classified into two basic categories: Researcher specified:  The researcher provides the seed points based on external data. The two common sources of the seed points are prior research or data from another multivariate analysis.  Sample generated:  To generate the cluster seeds from the observations of the sample , either in systematic or random selection 
  96. 96. Non Hierarchical Clustering Algorithms  Sequential Threshold method - first determine a cluster center, then group all objects that are within a predetermined threshold from the center - one cluster is created at a time  Parallel Threshold method - simultaneously several cluster centers are determined, then objects that are within a predetermined threshold from the centers are grouped  Optimizing Partitioning method - first a nonhierarchical procedure is run, then objects are reassigned so as to optimize an overall criterion.
  97. 97. Pros and Cons of Hierarchical Methods  It is more popular clustering method with Ward’s method and average linkage.  Advantages:  Simplicity  Measures of similarity  Speed
  98. 98. Disadvantages  Hierarchical methods can be misleading because undesirable early combinations may persist throughout the analysis and lead to artificial results.  To reduce the impact of outliers, the research analyze the data several times, each time deleting problem observations or outliers
  99. 99. Combination of Both Methods  Hierarchical technique is used to generate a complete set of cluster solutions, establish the cluster solutions, profile cluster centers to act as cluster seed points, and identify outliers.  After outliers are eliminated, remaining observation can be clustered by a non hierarchical method with the cluster centers from the hierarchical results acting as the initial seed points.
  100. 100. Decision on the number of cluster to be formed  Performing either a hierarchical or non hierarchical cluster analysis is determining the number of clusters.  Decision is critical for hierarchical techniques because the researcher must select the cluster solution to represent the data structure (called stopping rule).
  101. 101. Interpretation of the clusters  The cluster Centroid (i.e. a mean of the cluster) is particularly useful in interpretation.  It involves calculate the distinguishing characteristics of each clusters and identifying differences between clusters.  If cluster solutions fail to show large variation then the other cluster solutions should be calculate.  The cluster centroid should be assessed based on theory or practical experience.
  102. 102. Validation  Validation is essential in cluster analysis because the clusters are descriptive of structure and require additional support for their relevance.  Cross- validation  To cluster analyze separate samples  Comparing the cluster solutions  Assessing the correspondence of the results. In these instances, a common approach is  By creating two subsamples (randomly splitting the sample) and then comparing the two cluster solutions for consistency with respect to number of clusters and the cluster profiles.
  103. 103. Inferring Gene Functionality  Researchers want to know the functions of new genes  Simply comparing the new gene sequences to known DNA sequences often does not give away the actual function of gene  For 40% of sequenced genes, functionality cannot be ascertained by only comparing to sequences of other known genes  Microarrays allow biologists to infer gene function even when there is not enough evidence to infer function based on similarity alone
  104. 104. Microarray Analysis  Microarrays measure the activity (expression level) of the gene under varying conditions/time points  Expression level is estimated by measuring the amount of mRNA for that particular gene  A gene is active if it is being transcribed  More mRNA usually indicates more gene activity
  105. 105. Microarray Experiments  Analyze mRNA produced from cells in the tissue with the        environmental conditions you are testing Produce cDNA from mRNA (DNA is more stable) Attach phosphor to cDNA to see when a particular gene is expressed Different color phosphors are available to compare many samples at once Hybridize cDNA over the micro array Scan the microarray with a phosphor-illuminating laser Illumination reveals transcribed genes Scan microarray multiple times for the different color phosphor’s
  106. 106. Using Microarrays • Track the sample over a period of time to see gene expression over time •Track two different samples under the same conditions to see the difference in gene expressions Each box represents one gene’s expression over time
  107. 107. Using Microarrays (cont’d)  Green: expressed only from control  Red: expresses only from experimental cell  Yellow: equally expressed in both samples  Black: NOT expressed in either control or experimental cells
  108. 108. Microarray Data  Microarray data are usually transformed into an intensity matrix (below)  The intensity matrix allows biologists to make correlations between diferent genes (even if they are dissimilar) and to understand how genes functions might be related  Clustering comes into play Time: Time Y Time Z Gene 1 Intensity (expression level) of gene at measured time Time X 10 8 10 Gene 2 10 0 9 Gene 3 4 8.6 3 Gene 4 7 8 3 Gene 5 1 2 3
  109. 109. Microarray Data  Microarray data are usually transformed into an intensity matrix (below)  The intensity matrix allows biologists to make correlations between different genes (even if they are dissimilar) and to understand how genes functions might be related  Clustering comes into play Time: Time Y Time Z Gene 1 Intensity (expression level) of gene at measured time Time X 10 8 10 Gene 2 10 0 9 Gene 3 4 8.6 3 Gene 4 7 8 3 Gene 5 1 2 3
  110. 110. Clustering of Microarray Data  Plot each datum as a point in N-dimensional space  Make a distance matrix for the distance between every two gene points in the N-dimensional space  Genes with a small distance share the same expression characteristics and might be functionally related or similar!  Clustering reveal groups of functionally related genes
  111. 111. Hierarchical clustering Step 1: Transform genes * experiments matrix into genes * genes distance matrix Exp 1 Exp 2 Exp 3 Gene A Exp 4 Gene A Gene B Gene C Step 2: Cluster genes based on distance matrix and draw a dendrogram until single node remains Gene A Gene B Gene C Gene B Gene C 0 ? ? 0 ? 0
  112. 112. Data and distance matrix Genes A B C D E A 0.0 Patients A B C D E B 223.6 0.0 1 90 190 90 200 150 2 190 390 110 400 200 C 80.0 297.3 0.0 D 237.1 14.1 310.2 0.0 E 60.8 194.2 108.2 206.2 0.0
  113. 113. Hierarchical clustering (continued) G1 G2 G3 G4 G5 G1 0 2 6 10 9 G2 0 5 9 8 G3 0 4 5 G4 0 3 G5 G (12) G3 G4 G5 G (12) 0 6 10 9 2 3 4 G4 G5 0 4 5 0 3 0 0 G (12) G3 G (45) 1 G3 5 Stage P5 P4 P3 P2 P1 G (12) 0 6 10 G3 G (45) 0 5 0 Groups [1], [2], [3], [4], [5] [1 2], [3], [4], [5] [1 2], [3], [4 5] [1 2], [3 4 5] [1 2 3 4 5]
  114. 114. Clustering of Microarray Data (cont’d) Clusters
  115. 115. Hierarchical Clustering
  116. 116. Hierarchical Clustering: Example
  117. 117. Hierarchical Clustering: Example
  118. 118. Hierarchical Clustering: Example
  119. 119. Hierarchical Clustering: Example
  120. 120.     Factor analysis is an interdependence technique. Analyzing the correlation among a large number of variables. To summarize the information with a minimum loss of information. In factor analysis, we group variables by their correlations, such that variables in a group (factor) have high correlations with each other.
  121. 121. • Research problem ▫ Is the analysis exploratory or confirmatory? ▫ Select objectives:  Data summarization  Data reduction • Confirmatory ▫ Structural equation modeling • Exploratory ▫ Select the type of Factor Analysis What is being grouped – variables or cases? • Cases ▫ Q – type factor analysis or cluster analysis.
  122. 122. • Variables ▫ R – type factor analysis. • Research design ▫ What variables are included? ▫ How are the variables measured? ▫ What is the desired sample size? • Assumptions ▫ Statistical considerations of normality, linearity, and homoscedasticity. ▫ Homogeneity of sample ▫ Conceptual linkages
  123. 123. • Selecting a Factor Method ▫ Is the total variance or only common variance analyzed? • Total variance ▫ Extract factors with component analysis • Common variance ▫ Extract factors with common factor analysis • Specifying the Factor Matrix ▫ Determine the number of factors to be retained
  124. 124. • Selecting a Rotational Method ▫ Should the factors be correlated (oblique) or uncorrelated (orthogonal)? • Orthogonal Methods ▫ VARIMAX ▫ EQUIMAX ▫ QUARTIMAX • Oblique Methods ▫ Oblimin ▫ Promax ▫ Orthoblique • Interpreting the Rotated Factor Matrix ▫ Can significant loadings be found? ▫ Can factors be named? ▫ Are communalities sufficient?  If no, selecting a factor method  If yes, go to factor model respecification
  125. 125. • Factor model respecification ▫ Were any variables deleted? ▫ Do you want to change the number of factors? ▫ Do you want another type of rotation? If yes, selecting a Factor Method If no, go to validation of the factor matrix • Validation of the Factor Matrix ▫ Split/multiple samples ▫ Separate analysis for subgroups ▫ Identify influential cases
  126. 126. • Additional Uses ▫ Selection of Surrogate Variables ▫ Computation of Factor Scores ▫ Creation of Summated Scales
  127. 127. • Factor – summarize the original set of observed variables • Factor loadings – correlation between original variables and the factors. • Squared factor loadings – percentage of the variance in an original variable is explained by a factor.
  128. 128. Communality • In factor Analysis ,a measure of the percentage of a variable’s variation that is explained by the factors . • A relative high communality indicates that a variable has much in common with the other variables taken as a group.
  129. 129. Specifying the Unit of Analysis • First select the unit of analysis for factor analysis ▫ Variables (or) ▫ Respondents • Factor analysis would be applied to a correlation matrix of the variables. o R factor analysis – common type of factor analysis for variables • Factor analysis also may be applied to a correlation matrix of the individual respondents based on their characteristics. o Q factor analysis – Respondents o Q factor analysis is not utilized frequently because of difficult to calculate. o Instead , some type of cluster analysis is used to group individual respondents.
  130. 130. Data summarization • Explain the data in a smaller number of concepts that equally represent the original set of variables.
  131. 131. Variable selection • Whether factor analysis is used for data reduction or summarization, should consider the conceptual basis of the variables. • In assessing the dimension of store image, if no question on store workers were included, factor analysis would not be able to identify this dimension.
  132. 132. • Factor analysis is the “garbage in , garbage out” phenomenon. • If the researcher includes a large number of variables and hopes that factor analysis will “figure it out,” then there is a high possibility of poor results .
  133. 133. Designing a factor analysis • Factor analysis involves three basic decisions: ▫ Correlation among variables or respondents ▫ Variables selection and measurement issues ▫ Sample size
  134. 134. Correlations among variables or respondents • Two forms of factor analysis. Both utilize a correlation matrix as the basic data input. ▫ R type - use a traditional correlation matrix as input. ▫ Q type – use a factor matrix that would identify similar individuals. Q factor analysis is different from cluster analysis. Q – type factor analysis form grouping based on the intercorrelations between the respondents. Cluster analysis form grouping based on a distance based similarity measure.
  135. 135. Variable selection and measurement issues • Two specific questions must be answered: ▫ What type of variables can be used in factor analysis? ▫ How many variables should be included? • Correlations are easy to find in metric variables. • Non metric variables are more problematic . • To define dummy variables (coded 0-1) to represent categories of non metric variables then correlation is possible to find. • Boolean factor analysis are more appropriate if all the variables are dummy variables. • If a study is being designed to reveal factor structure, strive to have at least five variables for each proposed factor.
  136. 136. Sample size • For sample size: ▫ The sample must have more observations than variables. ▫ The minimum absolute sample size should be 50 observations. • Maximize the number of observations per variable, with a minimum of 5 and at least 10 observation per variable.
  137. 137. Need of factor analysis • The difficulties in a having too many independent variables in predicting the response variable are : ▫ ▫ ▫ ▫ ▫ Increased computational time to get solution . Increased time in data collection Too much expenditure in data collection Presence of redundant independent Difficulty in making inference . • These can be avoided using Factor Analysis
  138. 138. • Factor analysis aims at grouping the original input variables into factors which underlie the input variables • The total no of factors = total no of input variables But after performing Factor Analysis • The total no of factors in the study can be reduced by dropping the insignificant factors based on Certain Criteria
  139. 139. Objective of factor analysis • The main objective of Factor analysis is to summarize a large number of underlying factors into a smaller number of variables or factors which represent the basic factors underlying the data. • Factor analysis is used to uncover the latent structure(dimensions) of a set of variables. • It reduces attribute space from a larger number of variables to a smaller number of factors and as such is a “nondependent" procedure (that is, it does not assume a dependent variable is specified).
  140. 140. Assumptions • Factor analysis is designed for interval data, although it can also be used for ordinal data • The variables used in factor analysis should be linearly related to each other. This can be checked by looking at scatter plots of pairs of variables. • Obviously the variables must also be at least moderately correlated to each other, otherwise the number of factors will be almost the same as the number of original variables, which means that carrying out a factor analysis would be pointless.
  141. 141. Method of determining the appropriateness of factor analysis • If correlations is not greater than 0.30 then factor analysis is probably in appropriate. • The correlations among variables can also be analyzed by computing the partial correlations among variables. • If partial correlations are high then factor analysis is inappropriate. • Partial correlation should be small, because the variable can be explained by the variables loading on the factors.
  142. 142. • Bartlett test of sphericity: ▫ It is a statistical test for the presence of correlations among the variables. ▫ A statistically significant Bartlett’s test of sphericity (sig >0.50) indicates that sufficient correlations exist among the variables to proceed. • Measure of sampling adequacy (MSA): o MSA value must exceed 0.50 for both the overall test and each individual variable. o Variables with values less than 0.50 should be omitted from the factor analysis one at a time, with the smallest one being omitted each time. • The MSA increases as: o o o o The sample size increases The average correlations increase The number of variables increases The number of factors decreases.
  143. 143. Selecting a Factor extraction method • Before selecting the methods of factor extraction, must have some understanding of the variance for a variables and how it is divided or partitioned. • For the purpose of factor analysis, it is important to understand how much of a variables variance is shared with other variables.
  144. 144. • The total variance of any variable can be divided into three types of variance. ▫ Common variance: Variance in a variable that is shared with all other variables in the analysis. ▫ Specific variance (unique variance): variance associated with only a specific variable. This variance cannot be explained by the correlation to the other variables ▫ Error variance: It is also variance that cannot be explained by correlations with other variables.  As a variable is more highly correlated with one or more variables, the common variances (communality) increases.  Unreliable measures or other sources of error variance are introduced, then the common variance is reduced.
  145. 145. Factor analysis Vs principal component analysis Factor analysis Principal component analysis • It analyzes only the variance shared among the variables (common variance without error or unique variance). • It adjusts the diagonals of the correlation matrix with the unique factors. • The component score in PCA represent a linear combination of the observed variables weighted by Eigen vectors. • PCA do not represent underlying constructs. • It analyzes total variance. • It inserts 1’s on the diagonals of the correlation matrix. • The observed variables in FA are linear combination of the underlying and unique factors. • FA underlying constructs can be labeled and readily interpreted, given an accurate model specification. • Both models yield similar results if the number of variables exceeds 30 or the communalities exceed 0.60.
  146. 146. Number of factors to extract • Any decision on the number of factors to be retained should be based on several considerations: ▫ Factors with Eigen values greater than 1.0 ▫ A predetermined number of factors based on research objectives and prior research ▫ Enough factors to meet a specified percentage of variance explained, usually 60% or higher. ▫ Factors shown by the scree test to have substantial amounts of common variance.
  147. 147. 3.0 Scree Plot Eigenvalue 2.5 2.0 1.5 1.0 0.5 0.0 1 2 3 4 5 Component Number 6
  148. 148. Interpreting the factors • The three processes of factor interpretation ▫ Estimate the factor matrix  First unrotated factor matrix is computed, containing the factor loadings for each variable on each factor. ▫ Factor rotation ▫ Factor interpretation and respecification
  149. 149. 150 Rotating factors • When the factors are extracted, factor loading is obtained. Factor loadings are the correlation of each variable and the factor. When rotating the factors, the variance has been redistributed so that the factor loading pattern and percentage of variance for each of the factors is difference. • The objectives of rotating is to redistribute the variance from earlier factors to later ones to achieve a simple, theoretically more meaningful factor pattern, and make the result easily to be interpreted • Two type of factor rotation 1. Orthogonal factor rotation 2. Oblique facto rotation
  150. 150. Orthogonal factor rotation • Orthogonal rotation the axes are maintained at right angles. The objective of all methods of rotation is to simplify the rows and columns of the factor matrix. • By simplifying the rows, making as many values in each row as close to zero as possible. • By simplifying the columns, making as many values in each column as close to zero as possible. There are three rotation methods • 1) Quartimax, • 2) Varimax, • 3) Equimax.
  151. 151. Orthogonal factor rotation Unrotated factor II Rotated factor II +1.0 V1 V2 +.50 -1.0 -.50 0 +.50 +1.0 V4 -.50 -1.0 V5 Unrotated factor I V3 Rotated factor I
  152. 152. 153 1.The quartimax rotation is to simplify the rows of a factor matrix, i.e. focus on rotating the initial factor so that a variable loads high on one factor and as low as possible on all other factor. 2. The varimax rotation is to simplify the columns of the factor matrix. With this approach, the maximum possible simplification is reached of there are only 1’s and 0’s in a column. 3.The equimax rotation is a compromise between the quartimax and varimax. In practice, the first two are the most common one to apply.
  153. 153. Oblique rotation methods • Oblique rotations are similar to orthogonal rotation. • It allow correlated factors instead of maintaining independence between the rotated factors. • Oblique rotation the axes need not be maintained at right angle. • It represents the clustering of variables more accurately. • There are three rotation methods ▫ Oblimin ▫ Promax ▫ orthoblique
  154. 154. Oblique factor rotation Unrotated factor II +1.0 Orthogonal Rotated factor II V1 Oblique rotation factor II V2 +.50 -1.0 -.50 0 +.50 +1.0 Unrotated factor I V3 V4 -.50 -1.0 V5 Oblique rotation I Orthogonal Rotated factor I
  155. 155. Assessing factor analysis • In interpreting factors, a decision must be made regarding the factor loadings worth consideration and attention. • Loadings exceeding 0.70 are considered indicative of welldefined structure and are the goal of any factor analysis.
  156. 156. Interpreting a factor matrix • Step 1: Examine the factor matrix of loadings. ▫ The factor loading matrix contains the factor loading of each variable on each factor. ▫ Rotated loadings are usually used in factor interpretation unless data reduction is the sole objective. ▫ An oblique rotation has been used, two matrices of factor loadings are provided.  Factor pattern matrix  Factor structure matrix
  157. 157.  Factor pattern matrix  Represent the unique contribution of each variable to the factor.  Factor structure matrix  Simple correlation between variables and factors, but these loadings contain both the unique variance between variables and factors and the correlation among factors.
  158. 158. Validation of factor analysis Assessing the degree of generalizability of the results to the population and potential influence of individual cases on the overall results. • Use of a confirmatory perspective: ▫ The direct method of validating the results is to use a confirmatory perspective. ▫ Assess the replicability of the results either with a split sample in the original data set or with a separate sample.
  159. 159. Assessing factor structure stability • Factors stability is primarily dependent on the sample size and on the number of cases per variable. • Comparison of the two resulting factor matrices will provide an assessment of the robustness of the solution.
  160. 160. Detecting influential observations • The another issue to the validation of factor analysis is the detecting influential observations. • To estimate the model with and without observations identified as outliers to assess their impact on the results.
  161. 161. Additional uses of factor analysis results • Objective is ▫ To identify logical combinations of variables and better understand the interrelationships among variables, then factor interpretation will enough. ▫ To identify appropriate variables for subsequent application to other statistical techniques, then some form of data reduction will be used. ▫ There are three option for data reduction  Summated scale  surrogate variable  Factor score
  162. 162. Summated Scales • One of the common uses of factor analysis is the formation of summated scales, where we add the scores on all the variables loading on a component to create the score for the component. • To verify that the variables for a component are measuring similar entities that are legitimate to add together, we compute Chronbach's alpha. • If Chronbach's alpha is 0.70 or greater (0.60 or greater for exploratory research), we have support on the interval consistency of the items justifying their use in a summated scale.
  163. 163. Surrogate variable • The option of examining the factor matrix and selecting the variable with the highest factor loading on each factor to act as a surrogate variable. • The selection process is more difficult because two or more variables have loadings that are significant and close to each other. • Disadvantages of selecting a single surrogate variable ▫ It does not address the issue of measurement error. ▫ It also runs the risk of potentially misleading results by selecting only a single variable to represent a more complex result. Factor analysis calculating a summated scale or factor scores instead of the surrogate variable.
  164. 164. Factor Score • A number that represents each observations calculated value on each factor in a factor analysis. • At the initial stage ,the respondents assign scores for the variables. After performing factor analysis, each factor assigns a score for each respondent. Such score are called respondent factor score. • Factor scores are standardized to have a mean of ‘0’ and a standard deviation of ‘1’.