Upcoming SlideShare
×

# Correspondence Analysis

5,275 views

Published on

An introductory overview of Correspondence Analysis.

Published in: Education
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Buta... Those are tough questions.
I must admit, I have not used Correspondence Analysis (CA) since I made this presentation. Thus, I am not an expert in this discipline. I don't know if you can use CA scores in a logistic regression. But, what you can do is use the underlying independent categorial variable that is within your CA, and include this categorical variable among your independent variables within your logistic regression.
Your second question about multicollinearity, I am unclear how CA would run into multicollinearity issue since it typically deals with the 'correspondence' between just two categorical variables.

Are you sure you want to  Yes  No
• Hello,

Thanks for presentation, it's very useful.
I have a question concerning correspondence analysis and logistic regression.

If the task is to reduce categorical variable data I have to choose correspondence analysis. However, if I want to use results (scores) from the correspondence analysis to logistic regression; is it possible to do that? And is correspondence analysis can cope with multicollinearity problem?

Are you sure you want to  Yes  No
• PCA is not easy. You probably have to look at different sources and even accept that unless you are a professional mathematician you may not understand the whole thing. I don't. But, I understand it reasonably well enough to interpret its results when I actually restudy the material. The very basic essence of PCA, however, is not so complicated. PCA creates principal components that represent combinations of your independent variables. And, it does so in such a fashion so that such principal components are perpendicular to each other on a scatter plot. Thus, those principal components are not correlated at all. This is how it eliminates multicollinearity between independent variables.

Let's say you attempt to model auto sales nationwide. And, you have 15 independent variables that are of either a macroeconomic, demographic, and auto industry nature. Many of them are correlated. So, a multivariate model would suffer from multicollinearity. PCA essentially reduces those 15 variables into three principal components that are orthogonal to each (zero correlation).

In terms of precise calculation of the coordinates on each of those independent variables, I refer you to this book:

http://www.amazon.com/Principal-Components-Analysis-Quantitative-Applications/dp/0803931042/ref=sr_1_3?ie=UTF8&s=books&qid=1271956552&sr=1-3

that I actually studied at one point and reviewed. Also, the Wikipedia entry seemed pretty detailed and informative.

Are you sure you want to  Yes  No
• Yes Guy,Now you're exactly on the point.I need to understand the co-ordinates derived using PCA.I know the applications of it in further analysis but fail to get this basic understanding of how the co-ordinates are obtained.Have searched lot many pages on internet but in vain.Have attended Strang's lessons on linear algebra.But he gets lost towards the end.So unable to get it.Kindly help me out in explaining how to get co-ordinates using PCA.

Are you sure you want to  Yes  No
• Part of your question is simple and part is tricky. You can see that on slide 9, the Eigenvalue for the dimension F1 is equal to the sum of the Row Mass x Coordinate^2. So, the first row for the 16-24 year old equals: 15.3% times 0.718^2 = 0.079. You do that for all the age groups. You sum those up. And, you get the 0.095 Eigenvalue for the dimension F1.

To fully flesh out your question, we next should explain how PCA calculates the coordinates of F1 (its first principal component). This is pretty challenging. It deals with rotation of the X and Y axis by about 45 degrees in such a manner that the first principal component (F1) explains a majority of the variance between Y the dependent variable and one or more independent variables. By doing so, PCA is the best method to deal with and eliminate multicolinearity between independent variables.

PCA is not something one (me anyway) can clearly explain in a couple of paragraphs. For further understanding, I recommend you study materials on the subject at Wikipedia, Slideshare.net, Google knols, and other similar places. With much studying grasping the basics of PCA is not that difficult. But, given its counterintuitive nature (the principal components are often unexplainable combination of the X variables) it is not so often used outside fairly intensive quantitative circles. Unfortunately, PCA is the key engine for a lot of good stuff including Correspondence Analysis, Factor Analysis, Discriminant Analysis, and is also valuable as a stand alone method.

You may have heard of the Michael Mann's hockey stick controversy (Global Warming). It was solely a PCA application. Essentially, Steve McIntyre, a mathematician, uncovered that Michael Mann had overweighed certain tree ring data as proxy for temperature changes to generate principal components that in turn would generate the hockey stick pick up in temperature during the second half of the 20th century. When McIntyre corrected this overweighting, the long term trend in temperature returned to random and the hockey stick pattern disappeared. That's a dramatic example that suggests understanding PCA is a critical part of modern critical thinking.

Are you sure you want to  Yes  No
Views
Total views
5,275
On SlideShare
0
From Embeds
0
Number of Embeds
23
Actions
Shares
0
219
9
Likes
1
Embeds 0
No embeds

No notes for slide

### Correspondence Analysis

1. 1. Correspondence Analysis with XLStat Guy Lion Financial Modeling April 2005
2. 2. Statistical Methods Classification
3. 3. The Solar (PCA) System
4. 4. Capabilities <ul><li>Correspondence Analysis (CA) handles a Categorical Independent variable with a Categorical Dependent variable. </li></ul><ul><li>CA analyzes the association between two categorical variables by representing categories as points in a 2 or 3 dimensional space graph. </li></ul>
5. 5. 4 Steps <ul><li>Testing for Independence of both Variables (in XLStat only). </li></ul><ul><li>Compute Category Profiles (relative frequencies) and Masses (marginal proportions) for both Points-rows & Points-columns. </li></ul><ul><li>Calculate distance (Chi Square distance) between Points-rows, and then Points-columns. </li></ul><ul><li>Develop best-fitting space of n dimensions (relying on PCA). </li></ul>
6. 6. An Example: Moviegoers You classify by Age buckets the opinions of 1357 movie viewers on a movie.
7. 7. Testing Independence: Chi Square One cell (16-24/Good) accounts for 49.3% (73.1/148.3) of the Chi Square value for all 28 cells. Observed Expected Bad Average Good Very Good Total Bad Average Good Very Good Total 16-24 69 49 48 41 207 16-24 124.2 41.2 14.9 26.7 207 25-34 148 45 14 22 229 25-34 137.4 45.6 16.5 29.5 229 35-44 170 65 12 29 276 35-44 165.6 54.9 19.9 35.6 276 45-54 159 57 12 28 256 45-54 153.6 50.9 18.5 33.0 256 55-64 122 26 6 18 172 55-64 103.2 34.2 12.4 22.2 172 65-74 106 21 5 23 155 65-74 93.0 30.8 11.2 20.0 155 75+ 40 7 1 14 62 75+ 37.2 12.3 4.5 8.0 62 Total 814 270 98 175 1357 Total 814 270 98 175 1357 60% 20% 7% 13% 100% 60% 20% 7% 13% 100% Chi Square Calculations (Observed - Expected) 2 /Expected Bad Average Good Very Good Total (48 - 14.9) 2 /14.9 = 73.1 16-24 24.5 1.5 73.1 7.7 106.7 25-34 0.8 0.0 0.4 1.9 3.1 35-44 0.1 1.9 3.2 1.2 6.3 45-54 0.2 0.7 2.3 0.8 4.0 55-64 3.4 2.0 3.3 0.8 9.5 Chi Squ. 148.3 65-74 1.8 3.1 3.4 0.5 8.8 DF 18 = (7 -1)(4 - 1) 75+ 0.2 2.3 2.7 4.5 9.7 p value 1.613E-22 31.1 11.5 88.3 17.3 148.3
8. 8. Row Mass & Profile
9. 9. Eigenvalues of Dimensions Dimension F1 Eigenvalue 0.095 explains 86.6% (0.095/0.109) of the Inertia or Variance. F1 Coordinates are derived using PCA.
10. 10. Singular Value Singular value = SQRT(Eigenvalue). It is the maximum Canonical Correlation between the categories of the variables in analysis for any given dimension.
11. 11. Calculating Chi Square Distance for Points-rows Chi Square Distance defines the distance between a Point-row and the Centroid (Average) at the intersection of the F1 and F2 dimensions. The Point-row 16-24 is most distant from Centroid (0.72).
12. 12. Calculating Inertia [or Variance] using Points-rows XLStat calculates this table. It shows what Row category generates the most Inertia (Row 16-24 accounts for 72% of it)
13. 13. 2 other ways to calculate Inertia <ul><li>Inertia = Chi Square/Sample size. 148.27/1,357 = 0.109. </li></ul><ul><li>Inertia = Sum of Dimensions Eigenvalues. Shows how much each Dimension explains overall Inertia. </li></ul>
14. 14. Contribution of Points-rows to Dimension F1 The contribution of points to dimensions is the proportion of Inertia of a Dimension explained by the Point. The contribution of Points-rows to dimensions help us interpret the dimensions. The sum of contributions for each dimension equals 100%.
15. 15. Contribution of Dimension to Points-rows. Squared Correlation . <ul><li>Contribution of Dimensions to Points-rows tells how much of a Point Inertia is explained by a dimension. </li></ul><ul><li>Contributions are called Squared Correlations since they are the same as COS 2 for the angle between the line from the Centroid to the point and the principal axes. </li></ul>
16. 16. Squared Correlation = COS 2 If Contribution is high, the angle between the point vector and the axis is small.
17. 17. Quality Quality = Sum of the Squared Correlations for dimensions shown (normally F1 and F2). Quality is different for each Point-row (or Point-column). Quality represents whether the Point on a two dimensional graph is accurately represented. Quality is interpreted as proportion of Chi Square accounted for given the respective number of dimensions. A low quality means the current number of dimensions does not represent well the respective row (or column).
18. 18. Plot of Points-Rows
19. 19. Review of Calculation Flows
20. 20. Column Profile & Mass
21. 21. Calculating Chi Square Distance for Points-column Distance = SQRT(Sum(Column Profile – Avg. Column Profile 2 /Avg. Column Profile)
22. 22. Contribution of Points-column to Dimension F1 Contribution = (Col.Mass)(Coordinate 2 )/Eigenvalue
23. 23. Contribution of Dimension F1 to Points-columns
24. 24. Plot of Points-Columns
25. 25. Plot of all Points
26. 26. Observing the Correspondences
27. 27. Conclusion <ul><li>The Age Categories and Opinion Categories are dependent. Overall Chi Square P value 0.00%. </li></ul><ul><li>The most different Point-row is 16-24. 0.72 Chi Square distance from Centroid. Accounts for 72.0% of Inertia. </li></ul><ul><li>The most different Point-column is “Good.” 0.90 Chi Square distance from Centroid. Accounts for 59.6% of Inertia. </li></ul>
28. 28. Conclusion (continued) We have to remember that we can’t directly compare the Distance across categories (Row vs Column). We see that the 16-24 Point-row makes a greater contribution to Inertia and overall Chi Square vs the Good Point-column. This is because the 16-24 Point-row has a greater mass (207 occurrences vs only 98 for Good).