Upcoming SlideShare
×

Biva riate analysis pdf

985 views
896 views

Published on

Published in: Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
985
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
24
0
Likes
0
Embeds 0
No embeds

No notes for slide

Biva riate analysis pdf

1. 1. Bivariate 1 January 16, 2003 BIVARIATE ANALYSIS: ESTIMATING ASSOCIATIONS Carol S. Aneshensel University of California, Los AngelesI. OVERVIEW A. Determines whether two variables are empirically associated. B. Key issue: Is the association in the form anticipated by theory? C. Lays the foundation for multivariate analysisII. METHODS OF BIVARIATE ANALYSIS A. Criteria for Selection of a Method B. Proportions: Contingency Tables 1. Test of conditional probabilities 2. Used with 2 categorical variables 3. No distributional assumptions 4. Low statistical power 5. Calculation of Chi-squared (χ5) and its degrees of freedom C. Mean Differences: Analysis of Variance 1. Test of mean differences relative to variance 2. Independent variable is categorical 3. Dependent variable is interval or ratio. 4. No assumptions about the form of the association 5. Calculation of F and its degrees of freedom D. Correlation: Correlation Coefficients 1. Test of linear association 2. Two interval variables 3. Linear association 4. Calculation of t and its degrees of freedom
2. 2. Bivariate 2 BIVARIATE ANALYSIS: ESTIMATING ASSOCIATIONS Carol S. Aneshensel University of California, Los Angeles The first step in the analysis of a focal relationship is to determine whether there is anempirical association between its two component variables. This objective is accomplished bymeans of bivariate analysis. This analysis ascertains whether the values of the dependentvariable tend to coincide with those of the independent variable. In most instances, theassociation between two variables is assessed with a bivariate statistical technique (see below forexceptions). The three most commonly used techniques are contingency tables, analysis ofvariance (ANOVA), and correlations. The basic bivariate analysis is then usually extended to amultivariate form to evaluate whether the association can be interpreted as a relationship. Not any association, will do, however: We are interested in one particular association,that predicted by theory. If we expect to find a linear association, but find instead a U-shapedone, then our theory is not supported even though the two variables are associated with oneanother. Thus, the object of explanatory analysis is to ascertain whether the independent variableis associated with the dependent variable in the manner predicted by theory. In some instances, the association between two variables is assessed with a multivariaterather than a bivariate statistical technique. This situation arises when two or more variable areneeded to express the functional form of the association. For example, the correlation coefficientestimates the linear association between two variables, but a nonlinear association requires adifferent approach, such as the parabola specified by the terms X and X2. Although two analyticvariables (X and X2) are used to operationalize the form of the association (parabola), thesevariables pertain to one substantive theoretical variable (X). The two analytic variables are bestthought of as one 2-part variable that reflects the nonlinear form of the association with thedependent variable. Thus, the analysis is bivariate even though a multivariate statisticaltechnique is used. Although this distinction may appear to be hair-splitting, it reduces confusion about thefocal relationship when more than one term is used to operationalize the independent variable.For example, the categories of ethnicity might be converted into a set of dichotomous "dummyvariables" indicating whether the person is (1) African American, (2) Latino, (3) AsianAmerican, or (4) non-Latino White (or is in the excluded reference category of "Other"). Atheoretical model containing one independent variable, ethnicity, now appears to involve fourindependent variables. The four "dummy variables," however, are in actuality one compositevariable with five categories. This type of hybrid variable requires a multivariate statisticaltechnique, such as regression, even though it represents a bivariate association. The importance of bivariate analysis is sometimes overlooked because it superseded bymultivariate analysis. This misperception is reinforced by scientific journals that report bivariateassociations only in passing, if at all. This practice creates the misleading impression thatanalysis begins at the multivariate level. In reality, the multiple-variable model rests upon thefoundation laid by the thorough analysis of the 2-variable model. The proper specification of the
3. 3. Bivariate 3theoretical model at the bivariate level is essential to the quality of subsequent multivariateanalysis. Some forms of bivariate analysis require that variables be differentiated into independentor dependent types. For example, the analysis of group differences in means, either by t-test orANOVA, treats the group variable as independent, which means that the procedure isasymmetricalCdifferent values are obtained if the independent and dependent variables areinverted. In contrast, the Pearson correlation coefficient, the most widely used measure ofbivariate association, yields identical values irrespective of which variable is treated asdependent, meaning that it is symmetricalC the same coefficient and probability level areobtained if the two variables are interchanged. Similarly, the chi-squared (χ5) test forindependence between nominal variables yields the same value irrespective of whether thedependent variable appears in the rows or the columns of the contingency table. Although thetest of statistical significance is unchanged, switching variables yields different expressions ofthe association because row and column percentages are not interchangeable. Unlike thecorrelation coefficient, where both the statistic and test of statistical significance are symmetrical,only the probability level is symmetrical in the χ2 technique. Designating one variable as independent and the other variable as dependent is productiveeven when this differentiation is not required by the statistical method. The value of thisdesignation lies in setting the stage for subsequent multivariate analysis where this differentiationis required by most statistical techniques.1 This designation is helpful in the bivariate analysis ofthe focal relationship because multivariate analysis ultimately seeks to determine whether thebivariate association is indicative of a state of dependency between the two variables. Thisapproach makes more sense if the original association is conceptualized as a potentialrelationship. METHODS OF BIVARIATE ANALYSISSelection of a Method There is a multitude of statistical techniques for the assessment of bivariate associations.This profusion of techniques reflects a key consideration in the selection of a method of analysisC the measurement properties of the independent and dependent variable. For example,correlational techniques are the method of choice for analysis of two interval variables (when theassociation is assumed to be linear), but are not suitable to the analysis of two categoricalvariables. Given that there are numerous possible combinations of measurement types, there arenumerous analytic techniques. A second contributor to this proliferation is sample size: some methods are applicableonly to large samples. Statistical techniques are also distinguished from one another on the basisof assumptions about the distributional properties of the variables. For example, there aredifferent computational formulas for the simple t-test depending upon whether the variance of the1 Some multivariate techniques do not require that one variable be treated as dependent, for example, log-linear models, but this situation is an exception.
4. 4. Bivariate 4dependent variable is assumed to be the same in the two groups being compared. In contrast,nonparametric techniques make no distributional assumptions. The sheer number of alternative methods can bewilder. The bulk of bivariate analysis inthe social sciences, however, is conducted with three techniques: contingency table analysis ofproportions, ANOVA assessment of mean differences between groups, and correlationcoefficients. As illustrated in Figure 1, a key consideration in the selection of a technique is thelevel of measurement. The contingency table technique is used when both variables are nominal. Means are analyzed when independent variable is nominal and the dependent variable is intervalor ratio. Correlations are calculated when both variables are interval or ratio (and the associationis assumed to be linear). These three methods do not exhaust the possible combinations of independent anddependent variables, as indicated by the blank cells in Figure 1. Although there are alternativemethods of analysis for these combinations, many researchers adapt one of the three methodsshown in this figure. For instance, if the dependent variable is nominal and the independentvariable is measured at a higher level, the independent variable is often collapsed into categoricalform. This transformation permits the use of the familiar χ2 test, but wastes valuable informationabout the inherent ordering of the interval variable. Ordinal variables are a bit troublesome because they do not satisfy the assumptions of theinterval methods of analysis, but the use of a nominal method of analysis, the practicalalternative, is wasteful because it does not make use of the ordering information. Intervalmethods are often used for ordinal variables that approximate interval variables, that is, quasi-interval variables, but strictly speaking this practice is inappropriate. If the ordinal variable doesnot approximate an interval variable, then it can be treated as a nominal variable in χ2 test,although, once again, this practice wastes information. Why not use a statistical technique intended specifically for the exact measurementproperties of the independent and dependent variables instead of relaxing assumptions or loosingpower? Surely these techniques are more appropriate than the adaptations just described. Whataccounts for the popularity of bending methods or measures to permit the use of conventionalmethods of bivariate analysis? Quite simply, the conventional methods readily generalize intofamiliar methods of multivariate analysis. Correlations and ANOVA form the foundation formultiple linear regression. Similarly, logistic regression is based on the techniques used withcontingency tables. These three methods of bivariate analysis are used frequently, then, becauseof their conceptual continuity with common forms of multivariate analysis. The reader is referred to a standard statistic text for less frequently used methods ofassessing bivariate associations. These methods are omitted here so that excessive attention totechnique does not deflect attention from the logic of analysis. A discussion of the relativemerits of various types of correlation coefficients, for example, would unnecessarily divertattention from the question of whether a linear model, assumed in correlational techniques, isappropriate based on theory C not theory as taught in a statistics class, but the substantive theorydirecting the research. According to the gospel of statistical theory, my dismissive treatment oftechnically correct methods of bivariate analysis is heretical. In defense of this stance, I noteonly that I am merely calling attention to a widespread practice in applied analysis.
5. 5. Bivariate 5
6. 6. Bivariate 6 In addition to level of measurement, the selection of a statistical procedure should also bebased upon the type of association one expects to find. In most applications, this issue focuseson whether it is appropriate to use a linear model. In this context, linear means that there is aconstant rate of change in the dependent variable across all values of the independent variable.This concept makes sense only where there are constant intervals on both variables, which meansthat, strictly speaking, linearity is relevant only when both variables are measured at the intervallevel.2 In other applications, the issue is specifying which groups are expected to differ fromothers. An example of this approach would be hypothesizing that the prevalence of depression isgreater among women than men, as distinct from asserting that gender and depression areassociated with one another (see below). In practice, the expected functional form of an association is often overlooked in theselection of an analytic technique. We tend to become preoccupied with finding a procedure thatfits the measurement characteristics of the independent and dependent variables. The validity ofthe entire analysis, however, depends upon the selection of analytic techniques that matchestheory-based expectations about the form of the association. Unfortunately, theory is often muteon this topic. Nevertheless, it is incumbent upon the analyst to translate theory into theappropriate analytic model.Methods: Proportions, Means and Correlations In this section, the three most common methods of bivariate analysis are summarizedbriefly: χ5 tests of proportions, ANOVA for mean differences, and correlation coefficients forlinear associations. As noted above, the selection of a method is likely to be driven by themeasurement characteristics of the independent and dependent variables. Contingency tables areappropriate for two nominal variables; tests of mean differences are used for an interval outcomeand nominal independent variable; correlations are employed for linear associations betweeninterval variables (see Figure 1). If both variables are interval, but the association is expected tobe non-linear, then the correlational technique needs to be adapted to the expected form of theassociation using a multiple regression format. Each of these forms of analysis can be performed using any major statistical softwareprogram. The emphasis of this presentation, therefore, is not on computations, but oninterpretation. It is useful, however, to review the fundamentals of these methods of analysis tounderstand their proper use and interpretation.2 There are a few common exceptions. For example, correlation coefficients are often calculated for ordinal variables that are quasi- interval. Also, dichotomous dependent variables are often treated as interval because there is only one interval.
7. 7. Bivariate 7 It should be noted that the techniques discussed here are but a few of the many optionsavailable for bivariate analysis. Characteristics of ones data may make other approaches muchmore appropriate. The three techniques described here are highlighted because of theirwidespread use and because they form the basis for the most commonly used types ofmultivariate analysis. The anticipation of multivariate analysis makes it logical to conductbivariate analysis that is consistent with the multivariate model. Proportions: Contingency Tables. Two properties of the χ5 analysis of a contingencytable make it an especially appealing form of bivariate analysis. First, it is based on the lowestform of measurement, two nominal variables. The absence of level-of-measurement restrictionsmeans that the technique may also be used with ordinal, interval or ratio data. Second, thistechnique does not require assumptions about the nature of the association, in particular, it doesnot assume a linear association. It is used to determine whether any association is present in thedata, without specifying in advance the expected form of this association. This flexibility is themethods most appealing characteristic. These characteristics, however, also establish the limitations of the method. Using χ5analysis with higher-order variables means that some data is transformed into a lower form ofmeasurement, converted to categorical form. This transformation leads to a loss of informationand a concomitant loss of statistical power. Although other statistics for contingency tableanalysis take into consideration the ordinal quality of variables (e.g., Somerss D), thesetechniques are not as widely used as the simple yet less powerful χ5. Furthermore, the χ5 test only tells you that some association seems to be present, withoutregard to its theoretical relevance. The conclusion that an association is present is not nearly asmeaningful, compelling, or satisfying as the conclusion that the expected association is present.The χ5 test does not yield this information, although it is possible to adapt the method to this end. The χ5 test for independence is used to determine whether there is an association betweentwo categorical variables. If the two variables are unrelated, then the distribution of one variableshould be the same regardless of the value of the other variable. If instead the two variables areassociated, then the distribution of the dependent variable should differ across the values of theindependent variable. The χ5 test for independence does not distinguish between independentand dependent variables. Treating one variable as independent is optional and does not alter thevalue of the test statistic. The dependency between the variables could be stated in the reverse direction: thedistribution of the independent variable differs across categories of the dependent variable.Although immaterial to the calculation of χ5, this formulation is backwards in terms of the logicof cause and effect. It treats the dependent variable as fixed and assesses variation in theindependent variable across these fixed values. However, a variable that depends upon anotherdoes not have fixed values; its values vary according to the influence of the independent variable. For example, the association between gender and depression is best stated as differences in theprobability of being depressed between men and women, not whether the probability of being awoman differs between depressed and not depressed persons. The proper formulation of the
8. 8. Bivariate 8association, then, is to examine variation in the distribution of the outcome variable acrosscategories of its presumed cause.3 The χ5 test for independence is illustrated in Figure 2. In this contingency table, theindependent variable X appears in the rows (1...i) and the dependent variable Y appears in thecolumns (1...j). The analytic question is whether the distribution of Y varies across the categoriesof X.3 Whether this distribution is calculated as row or column percentages is immaterial.
9. 9. Bivariate 9
10. 10. Bivariate 10 The overall distribution of Y is given at the bottom of the table. For example, theproportion in column 1 is p1 = N.1/N; the proportion in column 2 is p2 = N.2/N; and so on until pj =N.j/N. This proportional distribution should be duplicated within each row of the table if Y isindeed independent of X. In other words, the distribution of subjects within row 1 shouldresemble the distribution of subjects within row 2, and so on through row i, the last row. Thissimilarity means that the proportion of subjects in column j should be similar across allcategories of X (1, 2, ... i), and similar to the overall proportion of subjects in column j (p.j orNij/N). This equivalency should be manifest for all values of Y across all values of X. The null hypothesis for the χ5 test for independence essentially states that identicalconditional probabilities are expected under the condition that Y does not depend upon X: H0: p11 = p21 = . . . = pi1 = p.1 = N.1/N (1) p12 = p22 = . . . = pi2 = p.2 = N.2/N ... p1j = p2j = . . . = pij = p.j = Nij/NThe null hypothesis is evaluated with the χ5 test statistic. Its definitional formula and degrees of 2freedom appear in Figure 2. A large χ value relative to its degrees of freedom leads to rejectionof the null hypothesis. The null hypothesis is rejected if any pij  Nij/N, that is, if any proportionwithin the table deviates substantially from the marginal distributions of X and Y. This resultmeans that the observed covariation between X and Y is unlikely to occur if in reality X and Y areindependent. The key to understanding this procedure is the calculation of expected values for the cellscomprising the contingency table. These values are calculated assuming that columndistributions are independent of row distributions (and vice versa). If so, then the overallmarginal proportions for the rows and columns should be replicated within the body of the table. The value expected under independence is compared to the observed value for each cell.If the assumption of independence is valid, there should be only a small difference betweenexpected and observed values. If instead there is a large difference for one or more cells, the χ5statistic will be large relative to its degrees of freedom and the null hypothesis will be rejected. The chief limitation of this procedure is the nonspecific nature of the null hypothesis.When we reject it, we know that one or more cell frequencies differs markedly from theexpected, but do not know which cells are deviant. We know that an association exists, but donot know the form of the association. It might be the one predicted by theory, but then again itmight be some other form of association. This limitation can be remedied if the expectedrelationship is more precisely specified prior to analysis. To continue the example given earlier in this section, we might hypothesize that femalesare more likely to be depressed than males. This hypothesis is considerably more precise thanthe hypothesis that gender and depression are associated with one another. It could beoperationalized as an odds ratio greater than 1.00 for depression for females relative to males.The odds ratio expresses the association between variables in a multiplicative form, meaning thata value of 1.00 is equivalent to independence. The odds ratio for the data reported in Table 1 is1.80. This is the exact value for the sample (N = 1,393). If we wish to extend this finding to the
11. 11. Bivariate 11population, we could calculate a confidence interval for the odds ratio and determine whether itincludes the value of 1.4 The 95% confidence interval for this example is 1.25 to 2.60. Thelower boundary is greater than 1, meaning that the odds of being depressed are significantlyhigher for women than men. Note that this conclusion is more precise than the conclusion madewith χ5, namely that depression is not independent of gender. Note also that we are interested only in the possibility that females are at greater risk. Agreater risk among males would disconfirm our theory. However, a male excess of depressionwould lead to rejection of the null hypothesis in the χ5 procedure. Our more specific nullhypothesis is rejected only if the confidence interval is greater than 1; we fail to reject if theconfidence interval does not differ from 1 or is smaller than 1. When only an association ishypothesized, we fail to reject only when the confidence interval includes 1, not when it issmaller than 1. By specifying the nature of the association we have increased the power of thetest.4 Agresti (1990:54-6) gives the following formula for the 100(1-α) percent confidence interval for log θ for large samples: log(θ)=zα/2 σ(logθ) where θ = n11n22/n12n21 is the sample value of the odds ratio, and σ(logθ) = [(1/n11) + (1/n22) + (1/n12) + (1/n21)]1/2 The confidence interval for θ is obtained by exponentiating (taking the antilog) of the endpoints of this interval. (Agresti, Alan 1990. Categorical Data Analysis (New York: John Wiley & Sons).
13. 13. Bivariate 13childbearing age prior to the baby boom period. These age trends are even more evident for having a married child. This event isrelatively rare for the youngest group, but quite common among persons over the age of 50 years. Age is strongly associated with having children who have divorced, including those who arecurrently divorced, and with having children who have remarried. No tests of statistical significance are presented for the data in Table 2 in the originalreport. The differences are quite large between the youngest cohort and the other cohorts. Giventhe large sample size, these differences are obviously of statistical significance. If χ2 tests were provided, however, there would be five such tests, one for each dependentvariable. At first glance, it may not be obvious that there are five dependent variables. Thisconfusion arises because the table reports percentages rather than cell frequencies (compare withFigure 2). (Cell frequencies can be extracted from Table 2 by multiplying the percentage by thesample size, and dividing by 100.) Only the percent "yes" is given in Table 2 because the percent"no" is implicit given that each of these variables is a dichotomy. Thus, Table 2 represents thecross-tabulation of age (in 4 categories) by each of the following dependent variables: adultchild/no; ever-married child/no; ever-divorced child/no; currently separated-divorced child/no;remarried child/no. The question of linearity is relevant to this example because the independent variable,age, is quasi-interval and the dependent variables are all dichotomies. Two distinct nonlinearpatterns are evident in these data. The first is an inverted U-shaped curve: a sharp increase,especially between the forties and fifties, followed by a decline among those who are seventy orolder. As noted previously, the researchers attribute the decline to a pre-baby boom cohorteffect. The first three dependent variables follow this pattern. The increase at earlier agesreflects the combined impact of life course considerations, such as the duration of risk for theevent. These factors appear to be the primary consideration for the second pattern, an initialincrease followed by a plateau, which describes the last two entries in the table. Means: Analysis of Variance. The ANOVA procedure is similar in many respects to theχ2 test for independence. In both techniques, the independent variable needs to be measured onlyat the nominal level. Also, the null hypothesis is structurally similar in the two procedures. Inthe case of χ5, we test whether proportions are constant across categories of the independentvariable and, therefore, equal to the overall marginal proportion. For ANOVA, we test whethermeans are constant across categories of the independent variables and, thus, equal to the grandmean. In both techniques, the null hypothesis is nonspecific: rejecting it is not informative aboutwhich categories of the independent variable differ. The main practical difference between methods is that ANOVA requires an interval levelof measurement for the dependent variable.6 This restriction means that the ANOVA techniqueis not as widely applicable as the χ5 test, for which the dependent variable can be nominal. Thelimitation of ANOVA to interval dependent variables, however, is a trade-off for greaterstatistical power.6 This assumption is violated when ANOVA is used for ordinal variables that approximate the interval level of measurement, but this procedure is, strictly speaking, incorrect.
14. 14. Bivariate 14 The χ5 test could be substituted for ANOVA by transforming an interval dependentvariable into categorical form. This approach is undesirable because valuable information is lostwhen a range of values is collapsed into a single category. Additional information is lost becausethe ordering of values is immaterial to χ2. The attendant decrease in statistical power makes χ5an unattractive alternative to ANOVA. The ANOVA procedure is concerned with both central tendency and spread. Themeasure of central tendency is the mean. It is calculated for the dependent variable, both overalland within groups defined by the categories of the independent variable. Specifically, the nullhypothesis is that the within-group mean is equal across groups and, therefore, equal to the grandmean: H0: µ1 = µ2 = ... = µj = µ (2)In this equation, µ is the mean, and j is the number of groups, which is the number of categorieson the independent variable. The null hypothesis is rejected if any µj  µ, if any group meandiffers from the grand mean. The critical issue in ANOVA is not the absolute difference in means, however, but thedifference in means relative to spread or the variance of the distribution.7 This feature isillustrated in Figure 3 for several hypothetical distributions. The first panel (a) displays largedifferences in means relative to the within-group variation, a pattern that yields a large value forF. This pattern would lead to a rejection of the null hypothesis. The second panel (b) displays the same absolute mean differences as the top panel, butsubstantially larger within-group variation. Although the means differ from one another, there isa large amount of overlap among the distributions. The overlap is so extensive that thedistribution with the lowest mean extends over with the distribution with the highest mean. Thispattern would produce a low F value, leading to failure to reject the null hypothesis. Thisconclusion is reached even though the absolute mean differences are the same as in panel a,which led to rejection of the null hypothesis. The difference in conclusions for the two sets ofdistributions arises from their spreads: the mean difference is large relative to the variance of thedistribution in panel a, but relatively small in panel b. The third panel (c) shows distributions with the same variances as the second panel (b),but with substantially larger mean differences. As in the previous case, the variances are large.In this instance, however, the mean differences between groups are also large. Similarly, in thelast panel (d) the variances are small, but so are the mean differences between groups. We wouldfail to reject the null hypothesis despite the small variances because these variances are large7 ANOVA was originally developed in terms of a variance, specifically the hypothesis that all of the group means are equal is equivalent to the hypothesis that the variance of the means is zero (Darlington 1974).
15. 15. Bivariate 15relative to the mean differences. As these examples illustrate, it is not the absolute value of themean differences that is crucial, but the mean difference relative to the variance.
16. 16. Bivariate 16
17. 17. Bivariate 17 ANOVA is based on the decomposition of variation in the dependent variable intowithin- and between-group components, with the groups being categories of the independentvariable. The calculations are based on the sum of squares, that is, deviation of observationsfrom the mean, as shown in Figure 4.8 The test statistic for ANOVA is F, which is a ratio ofvariances. If differences between the means are due to sampling error, then the F ratio should bearound 1.00. Large values of F (relative to its degrees of freedom) would lead to rejection of thenull hypothesis. Although ANOVA may be used when there are only 2 categories on the independentvariable, it is customary to use a t test in this situation. The definitional formula for t is:9 M1 - M2 / ((s1 /n1) + (s2 / n2))2 2 2 t = (3)T and F are equivalent when there are only two groups: t5 = F. Thus, the choice of one methodover the other is immaterial. ANOVA is not informative about which means are unequal: it tests only whether allmeans are equal to one another and to the grand mean. Rejection of the null hypothesis signifiesthat the dependent variable is probably not uniformly distributed across all values of theindependent variable, but does not reveal which cells deviate from expectations. Thus, anassociation appears to be present, but it is not known whether it is the one forecast by theory. Aswas the case with the χ5 procedure, additional steps are required to test more specific hypothesesabout the precise nature of the association. This may be done a priori or using a post hoc test(e.g., Scheffe). Specifying contrasts in advance (on the basis of theory) is preferable to examining all possible contrasts after the fact because the later penalizes you for making multiplecontrasts to reduce the risk of capitalizing on chance. Depending upon ones hypothesis, it also may be desirable to test for a trend. Trendanalysis is most relevant when the independent variable is interval and linearity is at issue. Inthis case, it may be desirable to partition the between-groups sum of squares into linear,quadratic, cubic, or higher-order trends. However, visual inspection of the data is usually the8 There are numerous alternative specifications of the ANOVA model depending upon whether fixed or random effects are modeled, whether there are equal or unequal cell frequencies, etc. The user should consult a text on ANOVA for full discussion of these technical concerns.9 This formula for t assumes unequal variances between groups; a slightly different formula is used if one can assume equal variances.
18. 18. Bivariate 18most informative approach for understanding the shape of the association. Simply plotting meanvalues is often more instructive than the results of sophisticated statistical tests.
19. 19. Bivariate 19
20. 20. Bivariate 20 Several core aspects of ANOVA are illustrated in Table 3, which shows group variationin levels of depressive symptoms. These data are from the survey of Toronto adults introducedearlier in this paper (Turner & Marino 1994; see Table 1). In addition to major depressivedisorder, this study also assessed the occurrence of depressive symptoms during the previousweek. This assessment was made with the Center for Epidemiologic Studies-Depression (CES-D) Scale, which is the summated total of 20 symptoms, each rated from (0) "0 days" through (3)"5-7 days." The average symptom level varies significantly by gender, age, and marital status. Thenature of this difference is clear for gender insofar as there are only two groups: the average islower for men than Table 3 Depression by Select Characteristicswomen. For ageand marital status,however, the Characteristic CES-DH N MDDIdifferences are less (Mean) (%)clear because morethan two groupsare being Gendercompared. Symptoms Male 10.21*** 603 7.7***are most commonamong the Female 13.10 788 12.9youngest adultsand thereafter Agedecline with age 18-25 15.14*** 304 18.4***(at least throughage 55). It is 26-35 10.92 470 9.8tempting toconclude that the 36-45 11.09 393 7.2youngest andoldest groups 46-55 9.15 224 4.7differ, given thatthese scores are the Marital Statusmost extreme. Married 9.98*** 673 6.6***These largedifferences, Previously Married 14.22 171 11.5however, may beoffset by Never Married 13.70 547 15.8exceedingly largevariances. As aresult, we are Total 11.79 1,391 10.6limited to thenonspecific Source: Turner and Marino (1994); Table 1.conclusion that H Depressive symptoms; Center for Epidemiology-Depression Scale. I Major Depressive Disorder; Composite International Diagnostic Interview. *** p < .001
21. 21. Bivariate 21depressive symptoms vary with age. The two unmarried groups have similar levels of symptoms compared to the markedlydifferent scores of the currently married. In the absence of specific tests for pairs of means,however, we can only conclude that at least one of these means differs from the grand mean. The far right column of Table 3 presents prevalence estimates for major depressivedisorder. These data are shown here to emphasize the similarity between the analysis of means(ANOVA) and the analysis of proportions (χ2). The prevalence of major depression differssignificantly by gender, age, and marital status. The nature of the gender difference is againclear, given that there are only two groups. However, the nature of the age and marital statusassociations is not specified for particular subgroups. Thus, we are limited to the conclusion thatdepression is associated with age and cannot conclude that depression declines with age, eventhough the prevalence of depression in the youngest age group is almost twice that of any otherage group. Similarly, although it is tempting to conclude that the married are substantially lesslikely to be depressed than the previously married or the never married, this information is notgiven by the overall test, meaning that we can only conclude that the three groups do not have thesame rate of depression. Finally, a comment on linearity. Figure 5 graphs the association between depression andage from the data in Table 3. In this example, age has been collapsed into four categories,meaning that it is ordinal rather than interval. In reality, however, age is an interval variable,making it reasonable to ask whether its association with depression is linear. The problem intreating age as a quasi-interval variable is the first interval, ages 18-25, which is shorter (7 years)than the other intervals (10 years). The problem of unequal age intervals can be circumvented,however, by assigning each interval a value equal to its midpoint. The observed age trend for depressive symptoms [Figure 5(a)] is distinctly nonlinear.Symptoms are most common among the youngest age group and least common in the oldest agegroup, but do not follow a pattern of steady decline between these extremes. Instead, there is aplateau in average symptom levels between the two middle age groups. The observed age trend for rates of major depressive disorder [Figure 5(b)] also isdistinctly nonlinear, although the pattern differs somewhat from the pattern for average symptomlevels. Like symptoms, disorder is most common for the youngest adults and least common forthe oldest adults. Unlike symptoms, however, the decline with age is apparent across the twomiddle age groups. Despite the continuity of decline, the trend is nonlinear because the declineduring the youngest period is noticeably steeper than thereafter. In sum, although ANOVA does not assume linear associations, it is possible to ascertainwhether this is the case when both variables are interval (or quasi-interval). The same is true forthe analysis of proportions using the χ2 test. We turn now to the correlation coefficient, whichassumes the linear form.
22. 22. Bivariate 22
23. 23. Bivariate 23 Correlations: Linear Associations. Although there are several correlation coefficients,Pearsons r is by far the most widely used. This coefficient is used when both the independent 10and dependent variables are measured at the interval level. From the perspective ofoperationalizing a theory-based relationship, the most important aspect of this technique is theassumption that the association between the independent and dependent variables is linear.11 It isconventional to recommend inspection of a scatterplot to ensure that there are no grossdepartures from linearity. This approach is illustrated in Figure 6 for both linear and nonlinearassociations. Although the shape of an association is usually clear in textbook illustrations such as thisone, it is more difficult to visualize associations from scatterplots in practice. The difficultyarises because large sample sizes generate too many data points, many of which overlap.Computer-generated scatterplots use symbols such as letters to signify the number ofobservations at a particular location, but it is difficult to mentally weigh the points in a plotaccording to these symbolic tallies. It is sometimes useful to select a small random sample ofones sample to circumvent this problem. Another tactic for detecting nonlinearity is to collapse the independent variable andexamine the distribution of means as one would in ANOVA. This technique is not asinformative as the scatterplot, given that many distinct values are collapsed into categories andmeans, but it is helpful in detecting departures from linearity, especially in combination with ascatterplot. Yet another strategy entails collapsing both variables into categorical form andexamining their cross-tabulation. This procedure sacrifices even more information than theprevious approach, but it may be helpful, especially if extreme scores are of special interest. The correlation coefficient r describes the association between two variables as thestraight line that minimizes the deviation between observed (Y) and estimated (ì) values of thedependent variable, as illustrated in Figure 7. This feature gives the method its name, the "least-squares method." The value of r measures the association between X and Y in terms of how10 As we have seen repeatedly, however, ordinal variables that approximate the interval level of measurement are often used in practice for statistical techniques that require an interval level of measurement.11 There are other requirements as well, including normal distributions and homoscedasticity. The reader is referred to a text on multiple regression for a through consideration of the correlational model and its assumptions.
24. 24. Bivariate 24closely the data points cluster around the least-squares line. The absolute value of r is large whenthe data points hover close to the least-squares line; when observations are more widelydispersed around this line, the absolute value of r is close to zero. The values of the correlationcoefficient range from 1, which indicates perfect correspondence, through 0, which signifies acomplete lack of correspondence, to -1, which connotes perfect inverse (i.e., negative)association (see Figure 6).
25. 25. Bivariate 25
26. 26. Bivariate 26
27. 27. Bivariate 27 The null hypothesis is once again that Y is independent of X, specifically, H0: r = 0. Thetest statistic for generalization from the sample to the population is t, computed as shown inFigure 7. Although a two-tailed test may be used, the direction of the association is usuallytheoretically important, which makes the use of a one-tailed test appropriate. It is important to note that this technique assumes that the association between X and Y islinear in form. If there is a nonlinear association, the value of r will be seriously misleading.This problem is illustrated in Figure 8. In this example, the data are better represented by aparabola than by a straight line. Correlational techniques for determining whether specificnonlinear trends are present entail a multivariate model. Here is suffices to note that thesetechniques are isomorphic to those described above for trends in ANOVA. The slope of the least-squared line is of interest because it quantifies the magnitude of theassociation between the independent and dependent variables. Specifically, the slope is thechange in Y produced by a one unit increase in X. A steep slope (in either direction) indicates astrong relationship whereas a weak relationship appears to be almost horizontal. The slope is notgiven by r, a common misconception, but can be derived from r and information about thedistributions of X and Y. Another indicator of the strength of the association is r2, which is theproportion of the variance in the dependent variable that is accounted for by the independentvariable. For example, two measures of depression strongly covary, specifically, the correlationbetween the Child Depression Inventory (CDI)12 and the Stony Brook (SB) Child Psychiatric 13Checklist measure of depression is quite strong (r = .59; p < .001), but well below the perfectcorrespondence (r = 1.00) that would be expected if both measures were perfectly reliable andvalid. The r2 value is .35, meaning that about a third of the variance in the Stony Brook is sharedwith the CDI (and vice versa). Although this correspondence is strong, most of the variance inthe two measures is not shared in common. The significance test for the correlation between theCDI and the SB (p < .001) indicates that it is extremely unlikely that this correlation would havebeen observed if in truth the two variables are not correlated with one another.12 Kovacs, M. & Beck, A.T. (1977). An empirical-clinical approach toward a definition of childhood depression. In J.G. Schulterbrandt & A. Raskin (Eds.), Depression in Childhood: Diagnosis, Treatment, and Conceptual Models. NY: Raven, pp. 1-25.13 Gadow, K.D. and Sprafkin, J. (1987). Stony Brook Child Psychiatric Checklist-3R. Stony Brook, New York. Unpublished manuscript.
28. 28. Bivariate 28 The correlation coefficient is an appropriate indicator of the association between the CDIand the SB for two reasons: (1) both variables are quasi-interval, and (2) the association can beassumed to be linear. The latter point is particularly important, given that the functional form ofassociations is often overlooked. The two variables are similar measures of the same construct,which means that an increase in one measure should be matched by an increase in the othermeasure. Moreover, this correspondence should be evident across the full span of values for bothvariables. There is no reason to anticipate, for example, a plateau, or a threshold effect. Thus,the correlation coefficient is an appropriate choice.
29. 29. Bivariate 29
30. 30. Bivariate 30 The correlation between the CDI and the SB is considerably stronger than theircorrelations with measures of other constructs. This pattern is expected, given that two measuresof the same construct should be more highly correlated than measures of different constructs. However, adolescent depression was also assessed with two other measures, SB ratingsmade by the mother and by the father. These measures correlate with the CDI (.29 and .22,respectively), and with the adolescents self assessment on the SB (.28 and .21, respectively), butfar below the correlation between the two adolescent measures (.59). Although these correlationsare all statistically significant, the parental measures account for no more that 8.4 percent of thevariance in the adolescent self-reports. Thus, as concluded earlier, parental ratings are notespecially good measures of adolescent mood, despite the fact that such ratings have beenstandard practice in psychiatric epidemiology. In sum, the few correlations reviewed here demonstrate the importance of consideringboth the statistical significance of an association and its substantive importance. SUMMARY Although the specifics of bivariate analysis are unique to each statistical technique, thereis a functional similarity across these methods. As noted above, the null hypothesis for bivariateanalysis states that the values on the two variables are independent of one another. The usualgoal is to reject this hypothesis, to conclude that the variables are not independent of one another. In practice, this means that knowing the values on one variable is informative about the likelyvalues on the second variable.