Cross Tabulation Andrew Martin PS 372 University of Kentucky
What is a cross-tabulation? A cross-tabulation displays the joint frequencies and relatives frequencies of two categorical (nominal or ordinal) variables. The distribution is listed for each combination of categories that exists between two variables. Each case is then placed in the cell of the table that represents the combination of values that corresponds to its score on the variables.
What is a cross-tabulation? Ex: Party Identification and gender (NES Data) Assuming party identification has three categories (Democrat, Independent, Republican) and gender has two (male and female), the table would have 6 cells. Part ID Male Female Dem. Cell 1 Cell 2 Ind. Cell 3 Cell 4 Rep. Cell 5 Cell 6
What is a cross-tabulation? If we use the seven-point party identification scale, the cross-tabulation gets bigger. Part ID Male Female Strong Democrat Cell 1 Cell 2 Weak Democrat Cell 3 Cell 4 Ind.-Democrat Cell 5 Cell 6 Independent Cell 7 Cell 8 Ind.-Republican Cell 9 Cell 10 Weak Republican Cell 11 Cell 12 Republican Cell 13 Cell 14
How to construct a cross-tab (1) Separate the cases into groups based on their values for the independent variables. (2) For each grouping on the independent variable compute the frequencies or percentages falling in each level of the dependent variable. (3) Decide whether the frequency or percentage distributions differ from group to group, and if so, by how much.
Remember, with cross-tabs 1. The column percentages are more important than the row percentages. 2. The column percentages should add to 100.
Research Questions We can use cross-tabs to investigate the following research questions: Is there a relationship between gender and partisanship? Are women more likely to be Democrats than men? If so, does this mean women are more liberal than men?
Cross-tabs and two-variable relationships We assume the relationship is such that: Gender --> Partisanship In other words, gender is the independent variable that explains variation in partisanship, which is the dependent variable.
Strength of Relationship Refers to how different the observed values of the dependent variables are in the categories of the independent variable. If every case of a dependent variable appeared in one category, there would be a perfect relationship. This almost never occurs. If the dependent variable is equally distributed for different categories of the independent variable, there is no relationship.
Direction of the relationship The direction of the relationship shows which values of the independent variable are associated with values of the dependent variable. If higher values of the independent variable are associated with higher values of the dependent variable, the relationship is positive . If lower values of the independent variable are associated with higher values of the independent variable, the relationship is negative .
Another research question Suppose you were asked to predict how Americans would respond to a question about making gun control laws more stringent. In absence of any information about their attitudes about gun control, what would you use to predict attitudes about pending gun control legislation? Potential answers: Ideology, partisanship, related survey questions about gun rights and restrictions.
Cross-tab limitations Sometimes it is practicable to examine the relationships of two variables by just looking at the tables. In some instances analysis involves many tables or tables with many cells. In those instances, it may be more useful to summarize the information using coefficients for ordinal data.
Calculating Coefficients for Ordinal Data We're not going to learn how to each statistic, but there are some basic concepts we should review. In particular, we need to know how to identify concordant pairs, discordant pairs and tied pairs.
Concordant, Discordant and Tied Pairs In a concordant pair , one case is higher than another case for both variables. In a discordant pair , one case is lower on one of the variables but higher on the other. In a tied pair both cases have the same value on one or both variables.
Values by Name and Variable Name Variable X Variable Y Alex 3 3 Carl 3 1 Dawn 2 3 Ernesto 2 2 Fay 2 1 Gus 1 3 Hera 1 2 Ike 1 1 Jasmine 1 1
Determining Pair Type Alex and Ike Variable X = 3 (Alex) – 1 (Ike) = 2 Variable Y = 3 (Alex) – 1 (Ike) = 2 Both numbers +; Alex and Ike are concordant
Determining Pair Type Carl and Ernesto Variable X = 3 (Carl) – 2 (Ernesto) = 1 Variable Y = 1 (Carl) – 2 (Ernesto) = -1 One number +, the other -; Carl and Ernesto are discordant
Determining Pair Type Ike and Jasmine Variable X = 1 (Ike) – 1 (Jasmine) = 0 Variable Y = 1 (Ike) – 1 (Jasmine) = 0 Both numbers are 0; this constitutes a tied pair.
Ordinal Coefficients There are four commonly used coefficients of association for ordinal data: Kendell's tau-b Kendell's tau-c Somer's d Goodman and Kruskal's gamma
Ordinal Coefficients Each are calculated somewhat differently (see JRM p. 442), but the intuition is that they measure the probability of concordant pairs minus the probability of discordant pairs. Measure = p concordance – p discordance where p = probability The measures treat tied pairs somewhat differently.
Ordinal Coefficient Properties 1. Theoretically all vary between -1 and 1. 2. In practice a -1 or 1 is unlikely. In fact, a measure of -.4 or .4 or greater illustrates a strong enough association to investigate further. 3. Since 0 means no correlation, values of -.1 to .1 suggest a weak relationship. 4. All ordinal measures of correlation will have the same sign in a given table.
Ordinal Coefficient Properties 5. The absolute value of gamma ( γ )will always be greater than or equal to the absolute value of any of the other measures. 6. The relationships among tau b, tau c and Somer's d are harder to generalize because they are affected differently by the structure of the table (that is, the number of rows and columns). 7. Somer's d is an asymmetric measure because its value depends on which variable is considered dependent.
Ordinal Coefficient Properties 8. A single measure by itself cannot assess how strongly one variable is related to another. After the statistical software calculates the measures, you should scrutinize the tables. Do not be lazy with analysis and interpretation. 9. These coefficients measure a particular type of association, namely correlation, whether linear or monotonic.
Do I use gamma or tau? <ul><li>Compare the two measures to see if gamma overestimated: </li></ul><ul><li>If |gamma| – |tau| is > .05, use tau because gamma is overestimated </li></ul><ul><li>If |gamma| – |tau| is ≤ .05, use gamma because gamma did not overestimate </li></ul>
Which tau measure do I use? <ul><li>1. Tau-b is used for SQUARE tables where there are an equal number of categories for both variables (i.e. where there are an equal number of columns and rows). </li></ul><ul><li>2. T au-c is used for all other tables. </li></ul>
When would I use Somer's d? <ul><li>1. Somer's d is preferable if you want an asymmetric measure (one that holds a distinction for independent and dependent variables). </li></ul>
Nominal Data Coefficients <ul><li>Proportional reduction of error (PRE) and Goodman-Kruskal's Lambda work essentially the same way. </li></ul><ul><li>Suppose you are asked to predict someone's electoral or policy preferences without any other prior information. </li></ul><ul><li>After making those guesses, you are given one other variable that may help increase the accuracy of your predictions, and you want to be able to measure the error reduction. </li></ul>
Nominal Data Coefficients <ul><li>For example: You are asked to guess without any additional information how each individual in a national sample of 1500 voters voted in the last presidential election. All you know is that 55 percent of the sample voted for Obama and 45 percent voted for McCain. </li></ul><ul><li>To minimize your number of errors you guess that every person in the sample voted for Obama. This way you get 55 percent of the sample correct. Still, you have 675 errors. </li></ul><ul><li>E 1 = 675 </li></ul>
Nominal Data Coefficients <ul><li>Next, you are given the party identification of each person in the sample. You accordingly guess that every Democrat voted for Obama and every Republican voted for McCain. </li></ul><ul><li>This was a good strategy, as 92 percent of Democrats voted for Obama and 88 percent of Republicans voted for McCain. You also know that the sample is divided evenly with 750 Democrats and 750 Republicans. </li></ul>
Nominal Data Coefficients <ul><li>My number of errors the second time (E 2 ) : </li></ul><ul><li>E 2 =750(1-.92) + 750(1-.88) </li></ul><ul><li>E 2 = 750(.08) + 750(.12) </li></ul><ul><li>E 2 = 60 + 90 = 150 </li></ul><ul><li>So the PRE is (E 1 ) – (E 2 ) = 675 – 150 = </li></ul><ul><li>(E 1 ) 675 .777 </li></ul><ul><li>There is a 77.7 percent reduction in error. </li></ul>
Nominal Data Coefficients <ul><li>The first error term should always be : </li></ul><ul><li>N – modal category for Y = E 1 </li></ul><ul><li>1500 – 825 = 675 </li></ul><ul><li>The Goodman-Kruskal Lamba measure can be adapted to work essentially the same way as the PRE measure. </li></ul>
Odds Ratio <ul><li>If you have a cross tabulation consisting of two dichotomous variables on of the best methods of analyzing the data is the odds ratio. </li></ul><ul><li>So, we want a 2X2 table. </li></ul>
Odds Ratio Data from the 2004 NES Question: Do you [favor/oppose] the death penalty for persons convicted of murder? Opinion Male Female Favor 430 395 Oppose 139 197
Odds Ratio Odds Ratio (OR) = odds men favor death penalty odds women favor death penalty 430 Odds Ratio (OR) = 139 = (430)(197) = 1.54 395 (139)(395) 197 The odds of men favoring the death penalty are about one and a half times greater than the odds of females favoring the death penalty.
What if we flip the categories? Odds Ratio (OR) = odds women favor death penalty odds men favor death penalty 395 Odds Ratio (OR) = 197 = (395)(139) = .65 430 (430)(197) 139 The odds of a woman favoring the death penalty are only about two-thirds the odds of a male doing the same.
Odds Ratio <ul><li>The odds ratio compares chances of likelihoods of something being chosen or happening. </li></ul><ul><li>In practice it is applied to discrete or categorical variables. </li></ul><ul><li>Unlike most measures, the odds ratio has a null value of 1.0, not 0. If the ratio is 1.0 there is no difference between the categories compared. </li></ul>
Odds Ratio <ul><li>The odds ratio is always positive. It can run from 0 to (plus) infinity. </li></ul><ul><li>The farther from 1.0 in either direction the stronger the association. </li></ul><ul><li>You can invert the ratio if you wish to switch the group you make a statement. Ex: Men can be 4.0 times as likely as men to approve of the death penalty, or women can be .25 as likely as men to favor the death penalty. The two statements are equivalent. </li></ul>
Odds Ratio <ul><li>The odds ratio is standardized. It is always expressed in units of odds. </li></ul><ul><li>The odds ratio can be applied to investigate patterns of association in tables larger than 2X2 and in multi-way cross-tabulations having more than three variables. </li></ul>