3.
3
Questions
• Why does the maximum
value of r equal 1.0?
• What does it mean when a
correlation is positive?
Negative?
• What is the purpose of the
Fisher r to z transformation?
• What is range restriction?
Range enhancement? What
do they do to r?
• Give an example in which
data properly analyzed by
ANOVA cannot be used to
infer causality.
• Why do we care about the
sampling distribution of the
correlation coefficient?
• What is the effect of reliability
on r?
4.
4
Basic Ideas
• Nominal vs. continuous IV
• Degree (direction) & closeness (magnitude) of linear relations
Sign (+ or ) for direction
Absolute value for magnitude
• Pearson productmoment correlation coefficient
N
zz
r
YXå=
5.
5
Illustrations
757269666360
Height
210
180
150
120
90
Weight
Plot of Weight by Height
4003002001000
Study Time
30
20
10
0
Errors
Plot of Errors by Study Time
1.91.81.71.61.5
Toe Size
700
600
500
400
SATV
Plot of SATV by Toe Size
Positive, negative, zero
6.
6
Graphic Representation
757269666360
Height
210
180
150
120
90
Weight
Plot of Weight by Height
757269666360
Height
Plot of Weight by HeightPlot of Weight by Height
Mean = 66.8 Inches
Mean = 150.7 lbs.
21012
Zheight
2
1
0
1
2
Zweight
Plot of Weight by Height in Zscores
2
1
0
1
2
Zweight
Plot of Weight by Height in ZscoresPlot of Weight by Height in Zscores
+


+
1. Conversion from raw to z.
2. Points & quadrants. Positive & negative products.
3. Correlation is average of cross products. Sign & magnitude of r
depend on where the points fall.
4. Product at maximum (average =1) when points on line where zX=zY.
7.
7
Correlation Analysis
• It measures the closeness of the relationship between two or more variables
• The degree of association or covariation between variables, no causality
• Measures of Association by Measurement
• Interpretation of Correlation
Ttest
9.
9
Questions
• What are predictors and criteria?
• Write an equation for the linear
regression. Describe each term.
• How do changes in the slope and
intercept affect (move) the
regression line?
• What does it mean to test the
significance of the regression sum
of squares? Rsquare?
• What is Rsquare?
• What does it mean to choose a
regression line to satisfy the loss
function of least squares?
• How do we find the slope and
intercept for the regression line with
a single independent variable?
(Either formula for the slope is
acceptable.)
• Why does testing for the regression
sum of squares turn out to have the
same result as testing for R
square?
10.
10
Basic Ideas
• Jargon
IV = X = Predictor (pl. predictors)
DV = Y = Criterion (pl. criteria)
Regression of Y on X e.g., GPA on SAT
• Linear Model = relations between IV and DV represented by straight line.
• A score on Y has 2 parts – (1) linear function of X and (2) error.
Y Xi i i= + +a b e (population values)
11.
11
Regression Analysis
• It refers to the techniques used to derive an equation that relates the
criterion variable to one or more predictor variables
• Method of least squares
• Standardized coefficients
• Goodness of fit
F test, t test, Coefficient of Determination
• multicollinearity
15.
15
Raw & Standardized Regression Weights
• Each X has a raw score slope, b.
• Slope tells expected change in Y if
X changes 1 unit*.
• Large b weights should indicate
important variables, but b depends
on variance of X.
• A b for height in inches would be
12 times larger than b for height in
feet.
• If we standardize X and Y, all units
of X are the same.
• Relative size of b now meaningful.
*strictly speaking, holding other X variables constant.
16.
16
Tests of R2 vs Tests of b
• Slopes (b) tell about the relation between Y and the unique part of X. R2
tells about proportion of variance in Y accounted for by set of predictors all
together.
• Correlations among X variables increase the standard errors of b weights
but not R2
.
• Possible to get significant R2
, but no or few significant b weights
• Possible but unlikely to have significant b but not significant R2
. Look to R2
first. If it is n.s., avoid interpreting b weights.
17.
17
Testing Incremental R2
You can start regression with a set of one or more variables and then
add predictors 1 or more at a time. When you add predictors, R
2
will
never go down. It usually goes up, and you can test whether the
increment in R
2
is significant or else if likely due to chance.
)1/()1(
)/()(
2
22


=
LL
SLSL
kNR
kkRR
F
2
LR
2
SR
Sk
Lk
=Rsquare for the larger model
=Rsquare for the smaller model
= number of predictors in the larger model
=number of predictors in the smaller model
18.
18
(cont.)
• In regression problems, the most commonly used indices of
importance are the correlation, r, and the increment to Rsquare
when the variable of interest is considered last. The second is
sometimes called a lastin Rsquare change. The lastin increment
corresponds to the Type III sums of squares and is closely related to
the b weight.
• The correlation tells about the importance of the variable ignoring
all other predictors.
• The lastin increment tells about the importance of the variable as
a unique contributor to the prediction of Y, above and beyond all
other predictors in the model.
•“Importance” is not well defined statistically when IVs are correlated.
Doesn’t include mediated models (path analysis).
19.
19
Collinearity Defined
• The problem of large correlations among the independent variables
• Within the set of IVs, one or more IVs are (nearly) totally predicted by the
other IVs.
• In such a case, the b or beta weights are poorly estimated.
• Problem of the “Bouncing Betas.”
20.
20
Dealing with Collinearity
• Lump it. Admit ambiguity; SE of b weights. Refer also to correlations.
• Select or combine variables.
• Factor analyze set of IVs.
• Use another type of analysis (e.g., path analysis).
• Use another type of regression (ridge regression).
• Unit weights (no longer regression).
21.
21
Diagnostics
Checking Assumptions and Bad Data
22.
22
GoodLooking Graph
64202
X
9
6
3
0
3
Y
No apparent departures from line.
23.
23
Problem with Linearity
50 100 150 200 250
Horsepower
10
20
30
40
50
MilesperGallon
R Sq Linear = 0.595
24.
24
Outliers
653202
X
10
8
6
3
1
1
Y
Outlier
Outlier = pathological point
26.
26
Nonparametric or Distributionfree Tests
• Two kinds of assertions in statistical tests: 1. Assertion directly related to the
purpose of investigation, i.e., hypothesis to be tested 2. Assertion to make a
probability statement. Set of all assertions is called the model
• Testing a hypothesis without a model is nonparametric test. That is, tests
which do not make basic assumptions about and without having the
knowledge of the distribution of the population parameters
27.
27
Characteristics
1.Do not depend on any assumptions about properties / parameters of the
parent population, I.e., do not suppose any particular distribution &
consequential assumptions (Parametric tests like ‘t’& ‘F’ tests make
assumption about homogeneity of the variances) & No such assumptions or
less restricting assumptions
2.When measurements are not so accurate, nonparametric tests come very
handy
3.Most nonparametric tests assume only nominal or ordinal data I.e., more
suitable (than parametric tests) for nominal & ordinal (or rated data)
4.Involves few arithmetic computations
28.
28
(cont.)
5.Usually less efficient & powerful than parametric tests as they are based on
no assumption
6.Greater risk of accepting a false hypothesis and committing type II error;
Nonparametric tests require more observations than parametric tests to
achieve the same size of type I and type II errors
7.Null hypothesis is somewhat loosely defined & hence rejection of null
hypothesis may lead to less precise conclusion than parametric tests
8.It is a trade off between loss in sharpness of estimating intervals and gain in
the ability of using less information & to calculate faster
29.
29
Some important applications are
(I)concerning single value for the given data
(II)difference among 2 or more sets of data
(III)relations between variables
(IV)variation in the given data
(V)randomness of a sample
(VI)association or dependency of categorical data
(VII)comparing theoretical population with actual data in categories
30.
30
Typical situation
1.Data not likely to be normally distributed
2.Nominal data from responses to questionnaire
3.Partially filled questions, i.e., to handle incomplete / missing data. to make
necessary adjustments to extract maximum information from average data
4.Reasonably good results from even very small sample but need more
observations than parametric tests to achieve the same size of type I and
type II errors
32.
32
Mc Nemer Test
•Useful for testing nominal data of two related samples and before –after
measurements of the same subjects with a view to judge the significance
for any observed change after treatment
33.
33
ChiSquare Test
• An important nonparametric test for significance of association as well as
for testing hypothesis regarding (i) goodness of fit and (ii) homogeneity or
significance of population variance
• When responses are classified into two mutually exclusive classes like favor
not favor, like dislike, etc.
• To find whether differences exist between observed and expected data
• χ2is not a measure of degree of relationship
• 2. Assumes random observations
• 3. Items in the sample are independent
34.
34
(cont.)
• Constraints are linear, no cell contains less than five as frequency value and
over all no. of items must be reasonably large (Yate’s correction can be
applied to a 2x2 table if cells frequencies are smaller than five); Use
KolmogorovSmirnov Test
• PHI Coefficient, φ= √χ2/ N , as a nonparametric measure of coefficient of
correlation helps to estimate the magnitude of association;
• Cramer’s Vmeasure, V = φ2/ √min. (r1), (c1)
• Coefficient of Contingency, C = √χ2/ χ2+ N , also known as coefficient of
mean square contingency, is a nonparametric measure of relationship
useful where contingency tables are higher order than 2x2 and combining
classes is not possible for Yule’s coefficient of association
35.
35
WilcoxonMannWhitney UTest
• Most powerful nonparametric test to determine whether two independent
samples have been drawn from the same population. Used as alternative to
ttest both for qualitative and quantitative data
• Both the samples are pooled together and elements arranged in ascending
order to find U
36.
36
Wilcoxon Matched Pair or Signed Rank Test
• Used in the context of tworelated samples where we can determine both
direction and magnitude of difference. Examples: wife & husband, subjects
studied before & after experiment, comparing output of two machines, etc.
• As it attaches greater weight to pair which shows a larger difference it is
more powerful test than sign test
• Null hypothesis (Ho) is that there is no difference in the two groups with
respect to characteristics under study
37.
37
K Sample (i.e., more than two sample) Tests
The KruskalWallis Test or H Test:
• Similar to U test;
• H0, ‘K’ individual random samples come from identical universes; does not
require approximation of normal distribution as H follows Chisquare
distribution; use Chisquare table.
38.
38
A few points on KW
• Calculation of Pvalues: (avoiding type I errors)
– F statistic: F distribution (requires normality)
– KW statistic: 2 distribution (requires large samples)
– Either statistic: Permutation tests
• Power: (avoiding type II errors)
– KW statistic more resistant to outliers
– F statistic more powerful in the case of normality
• KW statistic: don’t need to worry about transformations
39.
39
Reference
• Cohen, Louis and Manion, Lawrence. Research methods in education.
London: Routledge, 1980.
• Goode, William J and Hatt, Paul K. Methods on social research. London; Mc
GrawHill, 1981.
• Gopal, M.H. An introduction to research procedures in social sciences.
Bombay: Asia Publishing House, 1970.
• Koosis, Donald J. Business statistics. New York: John Wiley,1972.
40.
40
Multivariate Analysis
• Discriminant Analysis
It joins a nominally scaled criterion or dependent variable with one or
more independent variables that are interval or ratio scaled.
• Multivariate ANOVA
Assesses the relationship between two or more dependent variables
and classificatory variables or factors
• LISREL (Linear Structural Relationships)
Measurement and Structural equation model
Causality testing
41.
41
Interdependency Techniques
•Factor analysis
A factor is a linear combination of variables
Construct with a new set of variables based on the
relationships in the correlation matrix
Factor loading
Orthogonal or oblique rotaion
•Cluster Analysis
A set of technique for grouping similar objects or people
Be the first to comment