Correspondence Analysis with XLStat  Guy Lion Financial Modeling April 2005
Statistical Methods Classification
The Solar (PCA) System
Capabilities Correspondence Analysis (CA) handles a Categorical Independent variable with a Categorical Dependent variable. CA analyzes the association between two categorical variables by representing categories as points in a 2 or 3 dimensional space graph.
4 Steps Testing for Independence of both Variables (in XLStat only). Compute Category Profiles (relative frequencies) and Masses (marginal proportions) for both Points-rows & Points-columns. Calculate distance (Chi Square distance) between Points-rows, and then Points-columns. Develop best-fitting space of n dimensions (relying on PCA).
An Example: Moviegoers You classify by Age buckets the opinions of 1357 movie viewers on a movie.
Testing Independence: Chi Square  One cell (16-24/Good) accounts for 49.3% (73.1/148.3) of the Chi Square value for all 28 cells.  Observed Expected Bad Average Good Very Good Total Bad Average Good Very Good Total 16-24 69 49 48 41 207 16-24 124.2 41.2 14.9 26.7 207 25-34 148 45 14 22 229 25-34 137.4 45.6 16.5 29.5 229 35-44 170 65 12 29 276 35-44 165.6 54.9 19.9 35.6 276 45-54 159 57 12 28 256 45-54 153.6 50.9 18.5 33.0 256 55-64 122 26 6 18 172 55-64 103.2 34.2 12.4 22.2 172 65-74 106 21 5 23 155 65-74 93.0 30.8 11.2 20.0 155 75+ 40 7 1 14 62 75+ 37.2 12.3 4.5 8.0 62 Total 814 270 98 175 1357 Total 814 270 98 175 1357 60% 20% 7% 13% 100% 60% 20% 7% 13% 100% Chi Square Calculations (Observed - Expected) 2 /Expected Bad Average Good Very Good Total (48 - 14.9) 2 /14.9 = 73.1 16-24 24.5 1.5 73.1 7.7 106.7 25-34 0.8 0.0 0.4 1.9 3.1 35-44 0.1 1.9 3.2 1.2 6.3 45-54 0.2 0.7 2.3 0.8 4.0 55-64 3.4 2.0 3.3 0.8 9.5 Chi Squ. 148.3 65-74 1.8 3.1 3.4 0.5 8.8 DF 18 = (7 -1)(4 - 1) 75+ 0.2 2.3 2.7 4.5 9.7 p value 1.613E-22 31.1 11.5 88.3 17.3 148.3
Row Mass & Profile
Eigenvalues of Dimensions Dimension F1 Eigenvalue 0.095 explains 86.6% (0.095/0.109) of the Inertia or Variance.  F1 Coordinates are derived using PCA.
Singular Value Singular value = SQRT(Eigenvalue).  It is the maximum Canonical Correlation between the categories of the variables in analysis for any given dimension.
Calculating Chi Square Distance for Points-rows Chi Square Distance defines the distance between a Point-row and the Centroid (Average) at the intersection of the F1 and F2 dimensions.  The Point-row 16-24 is most distant from Centroid (0.72).
Calculating Inertia [or Variance] using Points-rows XLStat calculates this table.  It shows what Row category generates the most Inertia (Row 16-24 accounts for 72% of it)
2 other ways to calculate Inertia Inertia = Chi Square/Sample size.  148.27/1,357 = 0.109.  Inertia = Sum of Dimensions Eigenvalues. Shows how much each Dimension explains overall Inertia.
Contribution of Points-rows to Dimension F1 The contribution of points to dimensions is the proportion of Inertia of a Dimension explained by the Point.  The contribution of Points-rows to dimensions help us interpret the dimensions.  The sum of contributions for each dimension equals 100%.
Contribution  of   Dimension  to Points-rows.  Squared  Correlation .  Contribution of Dimensions to Points-rows tells how much of a Point Inertia is explained by a dimension.  Contributions are called Squared Correlations since they are the same as COS 2  for the angle between the line from the Centroid to the point and the principal axes.
Squared Correlation = COS 2 If Contribution is high, the angle between the point vector and the axis is small.
Quality Quality = Sum of the Squared Correlations for dimensions shown (normally F1 and F2).  Quality is different for each Point-row (or Point-column).  Quality represents whether the Point on a two dimensional graph is accurately represented.  Quality is interpreted as proportion of Chi Square accounted for given the respective number of dimensions.  A low quality means the current number of dimensions does not represent well the respective row (or column).
Plot of Points-Rows
Review of Calculation Flows
Column Profile & Mass
Calculating Chi Square Distance for Points-column Distance = SQRT(Sum(Column Profile – Avg. Column Profile 2 /Avg. Column Profile)
Contribution of Points-column to Dimension F1 Contribution = (Col.Mass)(Coordinate 2 )/Eigenvalue
Contribution of Dimension F1 to Points-columns
Plot of Points-Columns
Plot of all Points
Observing the Correspondences
Conclusion The Age Categories and Opinion Categories are dependent.  Overall Chi Square P value 0.00%. The most different  Point-row is 16-24.  0.72 Chi Square distance from Centroid.  Accounts for 72.0% of Inertia. The most different Point-column is “Good.” 0.90 Chi Square distance from Centroid.  Accounts for 59.6% of Inertia.
Conclusion (continued) We have to remember that we can’t directly compare the Distance across categories (Row vs Column). We see that the 16-24 Point-row makes a greater contribution to Inertia and overall Chi Square vs the Good Point-column.  This is because the 16-24 Point-row has a greater mass (207 occurrences vs only 98 for Good).

Correspondence Analysis

  • 1.
    Correspondence Analysis withXLStat Guy Lion Financial Modeling April 2005
  • 2.
  • 3.
  • 4.
    Capabilities Correspondence Analysis(CA) handles a Categorical Independent variable with a Categorical Dependent variable. CA analyzes the association between two categorical variables by representing categories as points in a 2 or 3 dimensional space graph.
  • 5.
    4 Steps Testingfor Independence of both Variables (in XLStat only). Compute Category Profiles (relative frequencies) and Masses (marginal proportions) for both Points-rows & Points-columns. Calculate distance (Chi Square distance) between Points-rows, and then Points-columns. Develop best-fitting space of n dimensions (relying on PCA).
  • 6.
    An Example: MoviegoersYou classify by Age buckets the opinions of 1357 movie viewers on a movie.
  • 7.
    Testing Independence: ChiSquare One cell (16-24/Good) accounts for 49.3% (73.1/148.3) of the Chi Square value for all 28 cells. Observed Expected Bad Average Good Very Good Total Bad Average Good Very Good Total 16-24 69 49 48 41 207 16-24 124.2 41.2 14.9 26.7 207 25-34 148 45 14 22 229 25-34 137.4 45.6 16.5 29.5 229 35-44 170 65 12 29 276 35-44 165.6 54.9 19.9 35.6 276 45-54 159 57 12 28 256 45-54 153.6 50.9 18.5 33.0 256 55-64 122 26 6 18 172 55-64 103.2 34.2 12.4 22.2 172 65-74 106 21 5 23 155 65-74 93.0 30.8 11.2 20.0 155 75+ 40 7 1 14 62 75+ 37.2 12.3 4.5 8.0 62 Total 814 270 98 175 1357 Total 814 270 98 175 1357 60% 20% 7% 13% 100% 60% 20% 7% 13% 100% Chi Square Calculations (Observed - Expected) 2 /Expected Bad Average Good Very Good Total (48 - 14.9) 2 /14.9 = 73.1 16-24 24.5 1.5 73.1 7.7 106.7 25-34 0.8 0.0 0.4 1.9 3.1 35-44 0.1 1.9 3.2 1.2 6.3 45-54 0.2 0.7 2.3 0.8 4.0 55-64 3.4 2.0 3.3 0.8 9.5 Chi Squ. 148.3 65-74 1.8 3.1 3.4 0.5 8.8 DF 18 = (7 -1)(4 - 1) 75+ 0.2 2.3 2.7 4.5 9.7 p value 1.613E-22 31.1 11.5 88.3 17.3 148.3
  • 8.
    Row Mass &Profile
  • 9.
    Eigenvalues of DimensionsDimension F1 Eigenvalue 0.095 explains 86.6% (0.095/0.109) of the Inertia or Variance. F1 Coordinates are derived using PCA.
  • 10.
    Singular Value Singularvalue = SQRT(Eigenvalue). It is the maximum Canonical Correlation between the categories of the variables in analysis for any given dimension.
  • 11.
    Calculating Chi SquareDistance for Points-rows Chi Square Distance defines the distance between a Point-row and the Centroid (Average) at the intersection of the F1 and F2 dimensions. The Point-row 16-24 is most distant from Centroid (0.72).
  • 12.
    Calculating Inertia [orVariance] using Points-rows XLStat calculates this table. It shows what Row category generates the most Inertia (Row 16-24 accounts for 72% of it)
  • 13.
    2 other waysto calculate Inertia Inertia = Chi Square/Sample size. 148.27/1,357 = 0.109. Inertia = Sum of Dimensions Eigenvalues. Shows how much each Dimension explains overall Inertia.
  • 14.
    Contribution of Points-rowsto Dimension F1 The contribution of points to dimensions is the proportion of Inertia of a Dimension explained by the Point. The contribution of Points-rows to dimensions help us interpret the dimensions. The sum of contributions for each dimension equals 100%.
  • 15.
    Contribution of Dimension to Points-rows. Squared Correlation . Contribution of Dimensions to Points-rows tells how much of a Point Inertia is explained by a dimension. Contributions are called Squared Correlations since they are the same as COS 2 for the angle between the line from the Centroid to the point and the principal axes.
  • 16.
    Squared Correlation =COS 2 If Contribution is high, the angle between the point vector and the axis is small.
  • 17.
    Quality Quality =Sum of the Squared Correlations for dimensions shown (normally F1 and F2). Quality is different for each Point-row (or Point-column). Quality represents whether the Point on a two dimensional graph is accurately represented. Quality is interpreted as proportion of Chi Square accounted for given the respective number of dimensions. A low quality means the current number of dimensions does not represent well the respective row (or column).
  • 18.
  • 19.
  • 20.
  • 21.
    Calculating Chi SquareDistance for Points-column Distance = SQRT(Sum(Column Profile – Avg. Column Profile 2 /Avg. Column Profile)
  • 22.
    Contribution of Points-columnto Dimension F1 Contribution = (Col.Mass)(Coordinate 2 )/Eigenvalue
  • 23.
    Contribution of DimensionF1 to Points-columns
  • 24.
  • 25.
  • 26.
  • 27.
    Conclusion The AgeCategories and Opinion Categories are dependent. Overall Chi Square P value 0.00%. The most different Point-row is 16-24. 0.72 Chi Square distance from Centroid. Accounts for 72.0% of Inertia. The most different Point-column is “Good.” 0.90 Chi Square distance from Centroid. Accounts for 59.6% of Inertia.
  • 28.
    Conclusion (continued) Wehave to remember that we can’t directly compare the Distance across categories (Row vs Column). We see that the 16-24 Point-row makes a greater contribution to Inertia and overall Chi Square vs the Good Point-column. This is because the 16-24 Point-row has a greater mass (207 occurrences vs only 98 for Good).