Upcoming SlideShare
×

# Discriminant analysis basicrelationships

1,042 views
988 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
1,042
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
82
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Discriminant analysis basicrelationships

1. 1. SW388R7 Discriminant Analysis – Basic RelationshipsData Analysis & Computers II Slide 1 Discriminant Functions and Scores Describing Relationships Classification Accuracy Sample Problems
2. 2. SW388R7 Discriminant analysisData Analysis & Computers II Slide 2  Discriminant analysis is used to analyze relationships between a non-metric dependent variable and metric or dichotomous independent variables.  Discriminant analysis attempts to use the independent variables to distinguish among the groups or categories of the dependent variable.  The usefulness of a discriminant model is based upon its accuracy rate, or ability to predict the known group memberships in the categories of the dependent variable.
3. 3. SW388R7 Discriminant scoresData Analysis & Computers II Slide 3  Discriminant analysis works by creating a new variable called the discriminant function score which is used to predict to which group a case belongs.  Discriminant function scores are computed similarly to factor scores, i.e. using eigenvalues. The computations find the coefficients for the independent variables that maximize the measure of distance between the groups defined by the dependent variable.  The discriminant function is similar to a regression equation in which the independent variables are multiplied by coefficients and summed to produce a score.
4. 4. SW388R7 Discriminant functionsData Analysis & Computers II Slide 4  Conceptually, we can think of the discriminant function or equation as defining the boundary between groups.  Discriminant scores are standardized, so that if the score falls on one side of the boundary (standard score less than zero, the case is predicted to be a member of one group) and if the score falls on the other side of the boundary (positive standard score), it is predicted to be a member of the other group.
5. 5. SW388R7 Number of functionsData Analysis & Computers II Slide 5  If the dependent variable defines two groups, one statistically significant discriminant function is required to distinguish the groups; if the dependent variable defines three groups, two statistically significant discriminant functions are required to distinguish among the three groups; etc.  If a discriminant function is able to distinguish among groups, it must have a strong relationship to at least one of the independent variables.  The number of possible discriminant functions in an analysis is limited to the smaller of the number of independent variables or one less than the number of groups defined by the dependent variable.
6. 6. SW388R7 Overall test of relationshipData Analysis & Computers II Slide 6  The overall test of relationship among the independent variables and groups defined by the dependent variable is a series of tests that each of the functions needed to distinguish among the groups is statistically significant.  In some analyses, we might discover that two or more of the groups defined by the dependent variable cannot be distinguished using the available independent variables. While it is reasonable to interpret a solution in which there are fewer significant discriminant functions than the maximum number possible, our problems will require that all of the possible discriminant functions be significant.
7. 7. SW388R7Data Analysis & Interpreting the relationship between independent and dependent variables Computers II Slide 7  The interpretative statement about the relationship between the independent variable and the dependent variable is a statement like: cases in group A tended to have higher scores on variable X than cases in group B or group C.  This interpretation is complicated by the fact that the relationship is not direct, but operates through the discriminant function.  Dependent variable groups are distinguished by scores on discriminant functions, not on values of independent variables. The scores on functions are based on the values of the independent variables that are multiplied by the function coefficients.
8. 8. SW388R7 Groups, functions, and variablesData Analysis & Computers II Slide 8  To interpret the relationship between an independent variable and the dependent variable, we must first identify how the discriminant functions separate the groups, and then the role of the independent variable is for each function.  SPSS provides a table called "Functions at Group Centroids" (multivariate means) that indicates which groups are separated by which functions.  SPSS provides another table called the "Structure Matrix" which, like its counterpart in factor analysis, identifies the loading, or correlation, between each independent variable and each function. This tells us which variables to interpret for each function. Each variable is interpreted on the function that it loads most highly on.
9. 9. SW388R7 Functions at Group CentroidsData Analysis & Computers II Slide 9 In order to specify the role that each independent variable plays in predicting group membership on the dependent variable, we must link together the relationship between the discriminant functions and the groups defined by the dependent variable, the role of the significant independent variables in the discriminant functions, and the differences in group means for each of the variables. Function 2 separates Functions at Group Centroids survey respondents who thought we spend Function too little money on welfare (positive value WELFARE 1 2 of 0.235) from survey 1 -.220 .235 respondents who 2 .446 -.031 thought we spend too 3 -.311 -.362 much money (negative value of -0.362) on Unstandardized canonical discriminant welfare. We ignore the functions evaluated at group means second group (-0.031) Function 1 separates survey respondents in this comparison who thought we spend about the right because it was amount of money on welfare (the positive distinguished from the value of 0.446) from survey respondents other two groups by who thought we spend too much (negative function 1. value of -0.311) or little money (negative value of -0.220) on welfare.
10. 10. SW388R7 Structure MatrixData Analysis & Computers II Slide 10 Based on the structure matrix, the We do not interpret predictor variables strongly associated with loadings in the discriminant function 1 which distinguished structure matrix unless between survey respondents who thought they are 0.30 or higher. we spend about the right amount of money on welfare and survey respondents who thought we spend too much or little money on welfare were number of hours worked in Structure Matrix the past week (r=-0.582) and highest year of school completed (r=0.687). Function 1 2 HIGHEST YEAR OF .687* .136 SCHOOL COMPLETED NUMBER OF HOURS -.582* .345 WORKED LAST WEEK R SELF-EMP OR WORKS .223 .889* FOR SOMEBODY RESPONDENTS INCOMEa .101 .292* Pooled within-groups correlations between discriminating variables and standardized canonical discriminant functions Variables ordered by absolute size of correlation within function. Based on the *. Largest absolute correlation between each variable and structure matrix, the predictor variable strongly associated with discriminant function 2 which any discriminant function distinguished a. between survey respondents who thought we spend too little This variable not used inand analysis. respondents money on welfare the survey who thought we spend too much money on welfare was self-employment (r=0.889).
11. 11. SW388R7 Group StatisticsData Analysis & Computers II Slide 11 Group Statistics Valid N (listwise) WELFARE Mean Std. Deviation Unweighted Weighted 1 TOO LITTLE NUMBER OF HOURS The average number of hours worked 43.96 13.240 WORKED LAST WEEK in the past week56.000 56 for survey HIGHEST YEAR OF respondents who thought we spend 13.73 2.401about the 56 right amount of money on 56.000 SCHOOL COMPLETED welfare (mean=37.90) was lower than R SELF-EMP OR WORKS 1.93 .260the average number of hours worked 56 56.000 FOR SOMEBODY in the past weeks for survey RESPONDENTS INCOME 13.70 5.034respondents who thought we spend too 56 56.000 2 ABOUT RIGHT NUMBER OF HOURS much money on welfare (mean=43.96) 37.90 13.235and survey respondents who thought 50 50.000 WORKED LAST WEEK HIGHEST YEAR OF we spend too little money on welfare 14.78 2.558(mean=42.03). 50.000 50 SCHOOL COMPLETED R SELF-EMP OR WORKS 1.90 .303This enables us to make the 50 50.000 FOR SOMEBODY statement: "survey respondents who RESPONDENTS INCOME 14.00 5.503thought we spend about the right 50 50.000 3 TOO MUCH NUMBER OF HOURS amount of money on welfare worked 42.03 10.456fewer hours in the past week than 32 32.000 WORKED LAST WEEK HIGHEST YEAR OF survey respondents who thought we 13.38 2.524spend too32 much 32.000 or little money on SCHOOL COMPLETED welfare." R SELF-EMP OR WORKS 1.75 .440 32 32.000 FOR SOMEBODY RESPONDENTS INCOME 14.75 5.304 32 32.000 Total NUMBER OF HOURS 41.32 12.846 138 138.000 WORKED LAST WEEK HIGHEST YEAR OF 14.03 2.537 138 138.000 SCHOOL COMPLETED R SELF-EMP OR WORKS
12. 12. SW388R7 Which independent variables to interpretData Analysis & Computers II Slide 12  In a simultaneous discriminant analysis, in which all independent variables are entered together, we only interpret the relationships for independent variables that have a loading of 0.30 or higher one or more discriminant functions. A variable can have a high loading on more than one function, which complicates the interpretation. We will interpret the variable for the function on which it has the highest loading.  In a stepwise discriminant analysis, we limit the interpretation of relationships between independent variables and groups defined by the dependent variable to those independent variables that met the statistical test for inclusion in the analysis.
13. 13. SW388R7 Discriminant analysis and classificationData Analysis & Computers II Slide 13  Discriminant analysis consists of two stages: in the first stage, the discriminant functions are derived; in the second stage, the discriminant functions are used to classify the cases.  While discriminant analysis does compute correlation measures to estimate the strength of the relationship, these correlations measure the relationship between the independent variables and the discriminant scores.  A more useful measure to assess the utility of a discriminant model is classification accuracy, which compares predicted group membership based on the discriminant model to the actual, known group membership which is the value for the dependent variable.
14. 14. SW388R7 Evaluating usefulness for discriminant modelsData Analysis & Computers II Slide 14  The benchmark that we will use to characterize a discriminant model as useful is a 25% improvement over the rate of accuracy achievable by chance alone.  Even if the independent variables had no relationship to the groups defined by the dependent variable, we would still expect to be correct in our predictions of group membership some percentage of the time. This is referred to as by chance accuracy.  The estimate of by chance accuracy that we will use is the proportional by chance accuracy rate, computed by summing the squared percentage of cases in each group.
15. 15. SW388R7 Comparing accuracy ratesData Analysis & Computers II Slide 15  To characterize our model as useful, we compare the cross- validated accuracy rate produced by SPSS to 25% more than the proportional by chance accuracy.  The cross-validated accuracy rate is a one-at-a-time hold out method that classifies each case based on a discriminant solution for all of the other cases in the analysis. It is a more realistic estimate of the accuracy rate we should expect in the population because discriminant analysis inflates accuracy rates when the cases classified are the same cases used to derive the discriminant functions.  Cross-validated accuracy rates are not produced by SPSS when separate covariance matrices are used in the classification, which we address more next week.
16. 16. SW388R7 Computing by chance accuracyData Analysis & Computers II Slide 16  The percentage of cases in each group defined by the dependent variable are reported in the table "Prior Probabilities for Groups" Prior Probabilities for Groups Cases Used in Analysis WELFARE Prior Unweighted Weighted 1 TOO LITTLE .406 56 56.000 2 ABOUT RIGHT .362 50 50.000 3 TOO MUCH .232 32 32.000 Total 1.000 138 138.000 The proportional by chance accuracy rate was computed by squaring and summing the proportion of cases in each group from the table of prior probabilities for groups (0.406² + 0.362² + 0.232² = 0.350). A 25% increase over this would require that our cross-validated accuracy be 43.7% (1.25 x 35.0% = 43.7%).
17. 17. SW388R7 Comparing the cross-validated accuracy rateData Analysis & Computers II Slide 17 b,c Classification Results Predicted Group Membership 1 TOO 2 ABOUT WELFARE LITTLE RIGHT 3 TOO MUCH Total Original Count 1 TOO LITTLE 43 15 6 64 2 ABOUT RIGHT 26 30 6 62 3 TOO MUCH 17 10 9 36 Ungrouped cases 3 3 2 8 % 1 TOO LITTLE 67.2 23.4 9.4 100.0 2 ABOUT RIGHT 41.9 48.4 9.7 100.0 3 TOO MUCH 47.2 27.8 25.0 100.0 Ungrouped cases 37.5 37.5 25.0 100.0 Cross-validated a Count 1 TOO LITTLE 43 15 6 64 SPSS reports the cross-validated accuracy rate 2 ABOUT RIGHT in the footnotes to the table "Classification 6 26 30 62 3 TOO MUCH The cross-validated accuracy rate 8 Results." 17 11 36 % 1 TOO LITTLE by SPSS was 50.0% which was 9.4 computed 67.2 23.4 100.0 greater than or equal to the proportional by 2 ABOUT RIGHT 41.9 48.4 9.7 100.0 chance accuracy criteria of 43.7%. 3 TOO MUCH 47.2 30.6 22.2 100.0 a. Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case. b. 50.6% of original grouped cases correctly classified. c. 50.0% of cross-validated grouped cases correctly classified.
18. 18. SW388R7 Problem 1Data Analysis & Computers II Slide 18 1. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic
19. 19. SW388R7 Dissecting problem 1 - 1Data Analysis & Computers II Slide 19 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated For these problems, we will movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last no problemsurvey respondents who had not seen an x-rated movie assume that there is year from with missing data, violation of in the last year. assumptions, or outliers. Survey respondents whowe are told tox-rated movie in the last year were younger than survey In this problem, had seen an respondents0.05 as alpha for the x-rated movie in the last year. Survey respondents who had use who had not seen an seen an discriminant analysis. last year were more likely to be male than survey respondents x-rated movie in the who had not seen an x-rated movie in the last year. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic
20. 20. SW388R7 Dissecting problem 1 - 2Data Analysis & Computers II Slide 20 The variables listed first in the problem statement are the independent variables 1. In the dataset GSS2000.sav, is theof school statement true, false, or an incorrect (IVs): "age" [age], "highest year following completed" [educ], "sex" [sex], and application of a statistic? Assume that there is no problem with missing data, violation of "income" [rincom98]. assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an x- rated movie in the last year. Survey variable usedwho had seen an x-rated movie in the last year were younger than survey The respondents to define respondents the dependent groups is who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents variable (DV): "seen x-rated movie in last year" [xmovie]. who had not seen an x-rated movie in the last year. When a problem states that a list of independent variables can distinguish among groups, we do a discriminant analysis entering all of the variables simultaneously.
21. 21. SW388R7 Dissecting problem 1 - 3Data Analysis & Computers II Slide 21 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an x- rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more the dependent The problem identifies two groups for likely to be male than survey respondents who had not seen an x-rated movie in the last year. variable: •survey respondents who had seen an x-rated 1. True movie in the last year •survey respondents who had not seen an x- 2. True with caution movie in the last year rated 3. False 4. Inappropriate application of a statistic the analysis will be To distinguish among two groups, required to find one statistically significant discriminant function.
22. 22. SW388R7 Dissecting problem 1 - 4Data Analysis & Computers II Slide 22 The specific relationships listed in the problem indicate how the independent The variables "age" [age], "highest year of school completed" [educ], "sex"the variable relates to groups of [sex], and "income" [rincom98] are useful in distinguishing between groups based on responsesmean for x-rated dependent variable, i.e., the to "seen movie in last year" [xmovie]. These predictors age will be lower for respondents who had seen differentiate survey respondents who had seen an x-rated movie in the last an x-rated movie in the last year from survey respondents who had not seen an x-rated movie year. in the last year. Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year. 1. True 2. True with caution 3. False In order for the discriminant analysis to be 4. Inappropriate application of a statistic we must have enough statistically true, significant functions to distinguish among the groups, the classification accuracy rate must be substantially better than could be obtained by chance alone, and each significant relationship must be interpreted correctly.
23. 23. SW388R7 LEVEL OF MEASUREMENT - 1Data Analysis & Computers II Slide 23 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an x- rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year. Discriminant analysis requires that the dependent variable be non-metric and the 1. True independent variables be metric or dichotomous. 2. True with caution"seen x-rated movie in last year" [xmovie] is an dichotomous variable, which satisfies the level of 3. False measurement requirement. 4. Inappropriate application of a statistic It contains two categories: survey respondents who had seen an x-rated movie in the last year and survey respondents who had not seen an x- rated movie in the last year.
24. 24. SW388R7 LEVEL OF MEASUREMENT - 2Data Analysis & Computers II Slide 24 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an x- rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents "Age" [age] and "highest year of schoolhad not seen an x-rated movie in the last year. who completed" [educ] are interval level variables, which satisfies the level of measurement 1. True requirements for discriminant "Income" [rincom98] is an ordinal level analysis. 2. True with caution variable. If we follow the convention of 3. False treating ordinal level variables as metric variables, the level of measurement 4. Inappropriate application of a statistic requirement for discriminant analysis is satisfied. Since some data analysts do not agree with this convention, a note "Sex" [sex] is a dichotomous or of caution should be included in our dummy-coded nominal variable interpretation. which may be included in discriminant analysis.
25. 25. SW388R7 Request simultaneous discriminant analysisData Analysis & Computers II Slide 25 Select the Classify | Discriminant… command from the Analyze menu.
26. 26. SW388R7 Selecting the dependent variableData Analysis & Computers II Slide 26 First, highlight the dependent variable xmovie in the list of variables. Second, click on the right arrow button to move the dependent variable to the Grouping Variable text box.
27. 27. SW388R7 Defining the group valuesData Analysis & Computers II Slide 27 When SPSS moves the dependent variable to the Grouping Variable textbox, it puts two question marks in parentheses after the variable name. This is a reminder that we have to enter the number that represent the groups we want to include in the analysis. First, to specify the group numbers, click on the Define Range… button.
28. 28. SW388R7 Completing the range of group valuesData Analysis & Computers II Slide 28 The value labels for xmovie show two categories: 1 = YES 2 = NO First, type in 1 in The range of values that we need the Minimum text to enter goes from 1 as the box. minimum and 2 as the maximum. Second, type in 2 in the Third, click on the Maximum text Continue button to box. close the dialog box.
29. 29. SW388R7 Selecting the independent variablesData Analysis & Computers II Slide 29 Move the independent variables listed in the problem to the Independents list box.
30. 30. SW388R7 Specifying the method for including variablesData Analysis & Computers II Slide 30 SPSS provides us with two methods for including variables: to enter all of the independent variables at one time, and a stepwise method for selecting variables using a statistical test to determine the order in which variables are included. Since the problem states that there is a relationship without requesting the best predictors, we accept the default to Enter independents together.
31. 31. SW388R7 Requesting statistics for the outputData Analysis & Computers II Slide 31 Click on the Statistics… button to select statistics we will need for the analysis.
32. 32. SW388R7 Specifying statistical outputData Analysis & Computers II Slide 32 First, mark the Means checkbox on the Descriptives panel. We will use the group means in our interpretation. Second, mark the Univariate ANOVAs checkbox on the Descriptives panel. Perusing these tests suggests which variables might be useful descriminators. Third, mark the Box’s M checkbox. Box’s M statistic Fourth, click on the evaluates conformity to the Continue button to assumption of homogeneity of close the dialog box. group variances.
33. 33. SW388R7 Specifying details for classificationData Analysis & Computers II Slide 33 Click on the Classify… button to specify details for the classification phase of the analysis.
34. 34. SW388R7 Details for classification - 1Data Analysis & Computers II Slide 34 First, mark the option button to Compute from group sizes on the Prior Probabilities panel. This incorporates the size of the groups defined by the dependent variable into the classification of cases using the discriminant functions. Second, mark the Casewise results checkbox on the Display panel to include classification details for each case in the output. Third, mark the Summary table checkbox to include summary tables comparing actual and predicted classification.
35. 35. SW388R7 Details for classification - 2Data Analysis & Computers II Slide 35 Fourth, mark the Leave-one-out classification checkbox to request SPSS to include a cross-validated classification in the output. This option produces a less biased estimate of classification accuracy by sequentially holding each case out of the calculations for the discriminant functions, and using the derived functions to classify the case held out.
36. 36. SW388R7 Details for classification - 3Data Analysis & Computers II Slide 36 Fifth, accept the default of Within-groups Seventh, click option button on the Use Covariance Matrix on the Continue panel. The Covariance matrices are the button to close measure of the dispersion in the groups the dialog box. defined by the dependent variable. If we fail the homogeneity of group variances test (Box’s M), our option is use Separate groups covariance in classification. Sixth, mark the Combines- groups checkbox on the Plots panel to obtain a visual plot of the relationship between functions and groups defined by the dependent variable.
37. 37. SW388R7 Completing the discriminant analysis requestData Analysis & Computers II Slide 37 Click on the OK button to request the output for the disciminant analysis.
38. 38. SW388R7 Sample size – ratio of cases to variablesData Analysis & Computers II Slide 38 Analysis Case Processing Summary Unweighted Cases N Percent Valid 119 44.1 Excluded Missing or out-of-range 49 18.1 group codes At least one missing 66 24.4 discriminating variable Both missing or out-of-range group codes The minimum ratio of valid 36 13.3 and at least one missing cases to independent discriminating variable variables for discriminant Total 151 analysis is 5 to 1, with a 55.9 Total 270 preferred ratio of 20 to 1. In 100.0 this analysis, there are 119 valid cases and 4 independent variables. The ratio of cases to independent variables is 29.75 to 1, which satisfies the minimum requirement. In addition, the ratio of 29.75 to 1 satisfies the preferred ratio of 20 to 1.
39. 39. SW388R7 Sample size – minimum group sizeData Analysis & Computers II Slide 39 Prior Probabilities for Groups Cases Used in Analysis In addition to the requirement for the XMOVIE Prior Unweighted Weighted ratio of cases to independent 1 .311 37 37.000 variables, discriminant analysis 2 .689 82 82.000 requires that there be a minimum Total 1.000 119 119.000 number of cases in the smallest group defined by the dependent variable. The number of cases in the smallest group must be larger than the number of independent variables, and preferably contains 20 or more cases. The number of cases in the smallest group in this problem is 37, which is larger than the number of independent variables (4), satisfying the minimum requirement. In addition, the number of cases in the smallest group satisfies the preferred minimum of 20 cases. If the sample size did not initially satisfy the minimum requirements, discriminant analysis is not appropriate.
40. 40. SW388R7 NUMBER OF DISCRIMINANT FUNCTIONS - 1Data Analysis & Computers II Slide 40 The maximum possible number of discriminant functions is the smaller of one less than the number of groups defined by the dependent variable and the number of independent variables. In this analysis there were 2 groups defined by seen x-rated movie in last year and 4 independent variables, so the maximum possible number of discriminant functions was 1.
41. 41. SW388R7 NUMBER OF DISCRIMINANT FUNCTIONS - 2Data Analysis & Computers II Slide 41 In the table of Wilks Lambda which tested functions for statistical significance, the direct analysis identified 1 discriminant functions that were statistically significant. The Wilks lambda statistic for the test of function 1 (chi-square=24.159) had a probability of <0.001 which was less than or equal to the level of significance of 0.05. The significance of the maximum possible number of discriminant functions supports the interpretation of a solution using 1 discriminant function.
42. 42. SW388R7Data Analysis & Independent variables and group membership: relationship of functions to groups Computers II Slide 42 In order to specify the role that each independent variable plays in predicting group membership on the dependent variable, we must link together the relationship between the discriminant functions and the groups defined by the dependent variable, the role of the significant independent variables in the discriminant functions, and the differences in group means for each of the variables. Each function divides the groups into two subgroups by assigning negative values to one subgroup and positive values to the other subgroup. Function 1 separates survey respondents who had seen an x- rated movie in the last year (-.714) from survey respondents who had not seen an Functions at Group Centroids x-rated movie in the last year (.322). Function XMOVIE 1 1 -.714 2 .322 Unstandardized canonical discriminant functions evaluated at group means
43. 43. SW388R7Data Analysis & Independent variables and group membership: predictor loadings on functions Computers II Slide 43 We do not interpret loadings in the structure Based on the structure matrix, the matrix unless they predictor variables strongly associated with are 0.30 or discriminant function 1 which distinguished higher. between survey respondents who had seen an x-rated movie in the last year and survey respondents who had not seen an x-rated movie in the last year were age (r=0.467) and sex (r=0.770). Structure Matrix Function 1 SEX .770 AGE .467 EDUC .118 RINCOM98 .044 Pooled within-groups correlations between discriminating variables and standardized canonical discriminant functions Variables ordered by absolute size of correlation within function.
44. 44. SW388R7Data Analysis & Independent variables and group membership: predictors associated with first function - 1 Computers II Slide 44 Group Statistics Valid N (listwise) XMOVIE Mean Std. Deviation Unweighted The Weighted age for survey average 1 AGE 37.24 10.838 37 37.000 respondents who had seen an EDUC 13.86 2.720 x-rated movie in the last year 37 37.000 SEX 1.27 .450 (mean=37.24) was lower than the 37 37.000 average age for survey RINCOM98 13.76 5.209 37 37.000 respondents who had not seen an 2 AGE 42.70 11.461 x-rated movie in the last year 82 82.000 EDUC 14.18 2.534 (mean=42.70). 82 82.000 SEX 1.65 .481 82 82.000 This supports the relationship that RINCOM98 14.00 5.308 82 82.000 "survey respondents who had seen Total AGE 41.00 11.508 an x-rated movie in the last year 119 119.000 EDUC 14.08 2.586 were younger than survey 119 119.000 respondents who had not seen an SEX 1.53 .501 119 119.000 x-rated movie in the last year." RINCOM98 13.92 5.256 119 119.000
45. 45. SW388R7Data Analysis & Independent variables and group membership: predictors associated with first function - 2 Computers II Slide 45 Group Statistics Valid N (listwise) XMOVIE Mean Std. Deviation Unweighted Weighted 1 AGE Since sex is a dichotomous variable, 37.24 10.838 37 37.000 the mean is not directly interpretable. EDUC 13.86 2.720 37Its interpretation must take into 37.000 SEX 1.27 .450 37account the coding by which 1 37.000 RINCOM98 13.76 5.209 37corresponds to male and 2 37.000 corresponds to female. The lower 2 AGE 42.70 11.461 82mean for survey respondents who 82.000 EDUC 14.18 2.534 82had seen an x-rated movie in the last 82.000 SEX 1.65 .481 82year 82.000 (mean=1.27), when compared to the mean for survey respondents RINCOM98 14.00 5.308 82who had not seen an x-rated movie in 82.000 Total AGE 41.00 11.508 119the last year (mean=1.65), implies 119.000 EDUC 14.08 2.586 119 that 119.000 the group contained more survey respondents who were male and SEX 1.53 .501 119fewer survey respondents who were 119.000 RINCOM98 13.92 5.256 119female.119.000 This supports the relationship that "survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year."
46. 46. SW388R7 CLASSIFICATION USING THE DISCRIMINANT MODEL:Data Analysis & by chance accuracy rate Computers II Slide 46 The independent variables could be characterized as useful predictors of membership in the groups defined by the dependent variable if the cross-validated classification accuracy rate was significantly higher than the accuracy attainable by chance alone. Operationally, the cross-validated classfication accuracy rate should be 25% or more higher than the proportional by chance accuracy rate. The proportional by chance accuracy rate was computed by squaring and summing the proportion of cases in each group from the table of prior probabilities for groups (0.311² + 0.689² = 0.571). Prior Probabilities for Groups Cases Used in Analysis XMOVIE Prior Unweighted Weighted 1 .311 37 37.000 2 .689 82 82.000 Total 1.000 119 119.000
47. 47. SW388R7 CLASSIFICATION USING THE DISCRIMINANT MODEL:Data Analysis & criteria for classification accuracy Computers II Slide 47 b,c Classification Results Predicted Group Membership XMOVIE 1 2 Total Original Count 1 15 22 37 2 12 70 82 Ungrouped cases 13 36 49 % 1 40.5 59.5 100.0 2 14.6 85.4 100.0 Ungrouped cases 26.5 73.5 100.0 Cross-validated a Count 1 15 22 37 2 12 70 82 % 1 40.5 59.5 100.0 2 14.6 85.4 100.0 a. Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case. b. 71.4% of original grouped cases correctly classified. c. 71.4% of cross-validated grouped cases correctly classified. The cross-validated accuracy rate computed by SPSS was 71.4% which was greater than or equal to the proportional by chance accuracy criteria of 71.4% (1.25 x 57.1% = 71.4%). The criteria for classification accuracy is satisfied.
48. 48. SW388R7 Answering the question in problem 1 - 1Data Analysis & Computers II Slide 48 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictors differentiate survey respondents who had seen an x-rated movie in the last year from survey respondents who had not seen an x- rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year. We found one statistically significant 1. True discriminant function, making it possible to 2. True with caution distinguish among the two groups defined by 3. False the dependent variable. 4. Inappropriate applicationcross-validated classification Moreover, the of a statistic accuracy surpassed the by chance accuracy criteria, supporting the utility of the model.
49. 49. SW388R7 Answering the question in problem 1 - 2Data Analysis & Computers II Slide 49 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. The variables "age" [age], "highest year of school completed" [educ], "sex" [sex], and "income" [rincom98] are useful in distinguishing between that each statement We verified groups based on responses to "seen x-rated movie in last year" [xmovie]. These predictorsrelationship between respondents who had seen about the differentiate survey an x-rated movie in the last year from survey respondentswas correct. seen an x-rated movie predictors and groups who had not in the last year. Survey respondents who had seen an x-rated movie in the last year were younger than survey respondents who had not seen an x-rated movie in the last year. Survey respondents who had seen an x-rated movie in the last year were more likely to be male than survey respondents who had not seen an x-rated movie in the last year. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic The answer to the question is true with caution. A caution is added because of the inclusion of ordinal level variables.
50. 50. SW388R7 Problem 2Data Analysis & Computers II Slide 50 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. From the list of variables "respondents degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didnt think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby. The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer. Survey respondents who didnt think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby prayed more often than survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic
51. 51. SW388R7 Dissecting problem 2 - 1Data Analysis & Computers II Slide 51 The variables listed first in the problem statement are the independent variables (IVs): "respondents degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend]. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. From the list of variables "respondents degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didnt think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby. The variable used to define The most important predictor of groups based on responses to attitude dependent groups is the toward abortion when there is a strong a problem asks us defect in the baby was variable (DV):prayer. toward When chance of serious frequency of "attitude to identify the best or abortion when there is a most useful predictors strong chance of serious from a list of defect in the baby" [abdefect] independent variables, we do stepwise discriminant analysis.
52. 52. SW388R7 Dissecting problem 2 - 2Data Analysis & Computers II Slide 52 The problem identifies two groups for the dependent variable: •survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby •survey respondents who didnt think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby. In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that amongis no groups, the analysis will be required toof assumptions, or outliers. To distinguish there two problem with missing data, violation find one Use a level of significance of 0.05 for evaluating the statistical relationship. statistically significant discriminant functions. From the list of variables "respondents degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didnt think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby. The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer. The importance of predictors is based upon the stepwise addition of variables to the analysis.
53. 53. SW388R7 Dissecting problem 2 - 3Data Analysis & Computers II Slide 53 From the list of variables "respondentslisted in thereligious fundamentalism" [fund], "frequency of The specific relationships degree of problem indicate how the prayer" [pray], and "frequency of attendancegroups of theservices" [attend], the most useful independent variable relates to at religious dependent variable, i.e., predictor for distinguishing frequency groups based be lower for respondents who the mean for between of prayer will on responses to "attitude toward abortion when there is a strong chance should be possible in thewoman [abdefect] is "frequency of prayer" [pray]. thought it of serious defect for a baby" to obtain a legal abortion if These predictors differentiate chance respondentsdefectthought it should be possible for a woman there is a strong survey of a serious who in the baby compared to to obtain a legal abortion if therewho didnt think it should be possible forin the baby from survey survey respondents is a strong chance of a serious defect a respondents who didnt think itashouldabortion if there is woman to obtain aa woman to obtain legal be possible for a a strong chance of legal abortion if there is a strong chance of a serious the baby. the baby. serious defect in defect in The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer. Survey respondents who didnt think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby prayed more often than survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby. 1. True 2. True with caution In a 3. False analysis, we only stepwise In order for a stepwise analysis to be interpret the independent 4. Inappropriate application of a statistic true, we must have enough statistically variables that are entered in significant functions to distinguish among the stepwise analysis. the groups, the order of entry must be correct, and each significant relationship must be interpreted correctly.
54. 54. SW388R7 LEVEL OF MEASUREMENT - 1Data Analysis & Computers II Slide 54 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. From the list of variables "respondents degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didnt think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby. The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer. Survey respondents who didnt think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby prayed more often than survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect requires that the Discriminant analysis in the baby. dependent variable be non-metric and the independent variables be metric or dichotomous. "Attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is a nominal level variable, which satisfies the level of measurement requirement.
55. 55. SW388R7 LEVEL OF MEASUREMENT - 2Data Analysis & Computers II Slide 55 In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data, violation of assumptions, or outliers. Use a level of significance of 0.05 for evaluating the statistical relationship. From the list of variables "respondents degree of religious fundamentalism" [fund], "frequency of prayer" [pray], and "frequency of attendance at religious services" [attend], the most useful predictor for distinguishing between groups based on responses to "attitude toward abortion when there is a strong chance of serious defect in the baby" [abdefect] is "frequency of prayer" [pray]. These predictors differentiate survey respondents who thought it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby from survey respondents who didnt think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby. The most important predictor of groups based on responses to attitude toward abortion when there is a strong chance of serious defect in the baby was frequency of prayer. "Respondents degree of religious Survey respondents who didnt think it should be possible for a woman to obtain a legal abortion if there is a strong chance of a serious defect in the baby [fund], "frequency ofthan fundamentalism" prayed more often survey respondents who thought it should be possible for [pray], and "frequency of abortion if prayer" a woman to obtain a legal there is a strong chance of a serious defect in theattendance at religious services" baby. [attend] are ordinal level variables. If we follow the convention of treating ordinal level variables as metric variables, the level of measurement requirement for discriminant analysis is satisfied. Since some data analysts do not agree with this convention, a note of caution should be included in our interpretation.
56. 56. SW388R7 Request stepwise discriminant analysisData Analysis & Computers II Slide 56 Select the Classify | Discriminant… command from the Analyze menu.
57. 57. SW388R7 Selecting the dependent variableData Analysis & Computers II Slide 57 First, highlight the dependent variable abdefect in the list of variables. Second, click on the right arrow button to move the dependent variable to the Grouping Variable text box.
58. 58. SW388R7 Defining the group valuesData Analysis & Computers II Slide 58 When SPSS moves the dependent variable to the Grouping Variable textbox, it puts two question marks in parentheses after the variable name. This is a reminder that we have to enter the number that represent the groups we want to include in the analysis. First, to specify the group numbers, click on the Define Range… button.
59. 59. SW388R7 Completing the range of group valuesData Analysis & Computers II Slide 59 The value labels for abdefect show two categories: 1 = YES 2 = NO First, type in 1 in The range of values that we need the Minimum text to enter goes from 1 as the box. minimum and 2 as the maximum. Second, type in 2 in the Third, click on the Maximum text Continue button to box. close the dialog box.
60. 60. SW388R7 Selecting the independent variablesData Analysis & Computers II Slide 60 Move the independent variables listed in the problem to the Independents list box.
61. 61. SW388R7 Specifying the method for including variablesData Analysis & Computers II Slide 61 SPSS provides us with two methods for including variables: to enter all of the independent variables at one time, and a stepwise method for selecting variables using a statistical test to determine the order in which variables are included. Since the problem calls for identifying the best predictors, we click on the option button to Use stepwise method.