1. Introduction to applied statistics
& applied statistical methods
1 Prof. Dr. Chang Zhu
Overview
• Chi-square test
• Discriminant analysis
• Logistic regression
• Nominal data/categorical data
1
2. 3
• Dichotomous variable
Only 2 values, yes or no, male or
female
• Binary variable
Assign a 0 (yes) or 1 (no) to indicate
presence or absence of something
Chi-square analysis
• Level of measurement is nominal
• The chi square test is non-parametric. It can
be used when normality is not assumed.
2
3. Chi-square analysis
Association between categorical variables
•Suppose both response and explanatory variables
are categorical.
•There is association if the population conditional
distribution for the response variable differs among
the categories of the explanatory variable
Example: Contingency table on happiness cross-classified
by family income (data from 2006 GSS)
Chi-square analysis
Happiness
Income Very Pretty Not too Total
---------------------------------------------
Above 272 (44%) 294 (48%) 49 (8%) 615
Average 454 (32%) 835 (59%) 131 (9%) 1420
Below 185 (20%) 527 (57%) 208 (23%) 920
----------------------------------------------
Response: Happiness,Explanatory: Income
Relationship between income and happiness?
3
4. Chi-Squared Test of Independence
(Karl Pearson, 1900)
• Tests H0: The variables are statistically independent
• Ha: The variables are statistically dependent
• Intuition behind test statistic: Summarize differences
between observed cell counts and expected cell
counts (what is expected if H0 true)
• Notation: fo = observed frequency (cell count)
fe = expected frequency
r = number of rows in table, c = number of columns
Chi-square analysis
• Chi-squared test answers “Is there an association?”
• Standardized residuals answer “How do data
differ from what independence predicts?”
• “How strong is the association?” using a measure
of the strength of association, such as the difference
of proportions
4
5. Chi-square analysis
• Like all tests of hypothesis, chi square is
sensitive to sample size.
– As N increases, obtained chi square increases.
– With large samples, trivial relationships may be
significant. To correct for this, when N>1000, set
your alpha = .01.
Practice 1
• CHI-SQUARE TEST (CROSS-TAB)
• A group of students were classified in terms of
personality (introvert or extrovert) and in
terms of colour preference (red, yellow, green
or blue). Personality and colour preference are
categorical variables. We want to find answer
to this question:
• Is there an association between personality and
colour preference?
5
6. Practice 1
• In SPSS, Analyze > Descriptive Statistics >
Crosstab
Practice 1 (output)
Chi-Square Tests
Asymp. Sig. (2-
Value df sided)
71.200a
Pearson Chi-Square 3 .000
Likelihood Ratio 70.066 3 .000
Linear-by-Linear Association 69.124 1 .000
N of Valid Cases 400
a. 0 cells (0.0%) have expected count less than 5. The
minimum expected count is 10.00.
There is a relationship between students’ personality and preferences
for colours: χ² (3, N = 400) = 71.20, p < .0001.
6
7. Discriminant analysis
• Similar to Regression, except that criterion (or
dependent variable) is categorical rather than
continuous.
• used to identify boundaries between groups of
objects
For example: (a) does a person have the disease
or not
(b) Is someone a good credit risk or not?
(c) Should a student be admitted to college or not?
Discriminant analysis
• We wish to predict group membership for
a number of subjects from a set of
predictor variables.
• The criterion variable (also called grouping
variable) is the object of classification. This
is ALWAYS a categorical variable.
• Simple case: two groups and p predictor
variables.
14
7
8. Discriminant analysis
• Similar to regression:
– What predictor variables are related to the
criterion (dependent variable)
– Predict values on the criterion variable when
given new values on the predictor variable
Discriminant analysis
• Can we classify new (unclassified) subjects into
groups?
– Given the classification functions how accurate are
we? And when we are inaccurate is there some
pattern to the misclassification?
D = (.024 × age) + (.080 × self-concept) + (-.100 × anxiety) + (-.012 days absent)
+ (.134 anti-smoking score) - 4.543
• What is the strength of association between
group membership and the predictors?
8
9. Discriminant analysis
Questions?
•Which predictors are most important in
predicting group membership?
Practice 2
A study is set up to determine if the following variables
help to discriminate between those who smoke and those
whose don’t:
•age
•absence (days of absence last year)
•selfcon (self-concept score)
•anxiety (anxiety score)
•anti_smoking (attitude towards anti-smoking policies)
9
10. Practice 2
• In SPSS, Analyze > Classify > Discriminant
Practice 2
• In SPSS, Analyze > Classify > Discriminant
10
11. Practice 2
Functions at Group CentroidsCanonical Discriminant
Function Coefficients
(means of group calculated by the D function)Function
1
Functionage .024
self concept score .080
anxiety score -.100 1
days absent last year -.012
non-smoker 1.125total anti-smoking test score .134
(Constant) -4.543
smoker -1.598Unstandardized coefficients
D = (.024 × age) + (.080 × self-concept) + (-.100 × anxiety) + (-.012 days absent) +
(.134 anti-smoking score) - 4.543
Practice 2
Classification Resultsa,c
Predicted Group
Membership
smoke or not
non-
smoker smoker T otal
Original Count non-smoker 19238 257
smoker 17 164 181
% non-smoker 92.6 7.4 100.0
smoker 9.4 90.6 100.0
Cross- Count non-smoker 238 19 257
validatedb
smoker 17 164 181
% non-smoker 92.6 7.4 100.0
smoker 9.4 90.6 100.0
a. 91.8% of original grouped cases correctly classified.
11
12. Practice 2
When reporting the result, we should include the following:
• Name of the predictors and sample size
• Results of the Univariate ANOVAs and the Box’s M test
• The significance of the discriminant function
• The variance explained (Canonical correlation coefficient)
• Significant predictors and their contribution to the
model (discriminant function)
• Result from the cross-validation process
(page 9)
Logistic regression
• In logistic regression the response (Y) is a
dichotomous categorical variable.
For example: voting, mortality, and
participation data is not continuous or
distributed normally.
Binary logistic regression is a type of
regression analysis where the dependent
variable is a dummy variable: coded 0 (did not
vote) or 1(did vote)
12
13. Logistic regression
• Models the relationship between a set of
variables xi
– dichotomous (eat : yes/no)
– categorical (social class, ... )
– continuous (age, ...)
and
– dichotomous variable Y
Binary Logistic regression
• Binary logistic regression is a type of
regression analysis where the dependent
variable is a dummy variable (coded 0, 1)
13
14. BinaryBinary LogisticDependentregressionVariables
A few examples:
Consumer chooses brand (1) or not
(0); A quality defect occurs (1) or not
(0); A person is hired (1) or not (0);
Other Examples
Binary Logistic regression
• The logistic regression model is simply a non-
linear transformation of the linear regression.
• The logistic distribution is an S-shaped
distribution function (cumulative density
function) which is similar to the standard
normal distribution and constrains the
estimated probabilities to lie between 0 and 1.
14
15. Binary Logistic regression
• p: the probability of success/event (range from 0 to 1)
• 1-p: probability of failure/non-event
If the probability of success is .8 (80%), the
probability of failure is ???
• The odds of success: the ratio between the probability
of success over the probability of failure
• What is the odds of success for the above situation?
• What can we conclude about the probabilities of success
and failure in a situation when odds equal to 1?
Binary Logistic regression
• The odds of success: the ratio between the probability
of success over the probability of failure
• Logistic regression: model the logit-transformed probability as
a linear relationship with the predictor variables.
• logit(p) = log(p/(1-p)) = log (odds) = b0 + b1*x1 + ... + bk*xk to a
probability:
p= exp(b0 + b1*x1 + ... + bk*xk)/(1+exp(b0 + b1*x1 + ... + bk*xk))
15
16. Binary Logistic regression
(SPSS output)
Variables in the Equation
B Exp(B)
Step 1a (log odds) S.E. Wald df Sig. (odds)
-.005 .202 .001 1 .981 .995gender(1)
• If the odds ratio > 1: when the predictor increases, the odds
of the event occurs increase.
• If the odds ratio < 1: when the predictor increases, the odds
of the event occurs decreases.
Practice 3
• Conduct logistic regression to see if gender is a
significant predictor of whether someone is a
smoker or non-smoker.
• In SPSS, Analyze > Regression > Binary
Logistic
• The data file is smoker_DA.sav.
16
17. Practice 3
Practice 3
Analyze > Regression > Binary Logistic
Practice 3
• Conduct logistic regression to see if anti-
smoking attitude is a significant predictor
of whether someone is a smoker or non-
smoker.
• In SPSS, Analyze > Regression > Binary
Logistic
• The data file is smoker_DA.sav.
17
18. Practice 3
Practice 3
Conduct logistic regression to see the following are
significant predictors of whether someone is a smoker or
non-smoker:
•age
•gender
•absence (days of absence last year)
•selfcon (self-concept score) •anxiety
(anxiety score)
•anti_smoking (attitude towards anti-smoking policies)
When we have no idea about the importance of the
predictors, so we’ll choose Stepwise: Forward LR)
Practice 3
95% C.I. for Odds
Ratio
B S.E. Odds Ratio Lower Upper
constant 9.257**
2.050 10480.856
self-concept
-.260** .033 .771 .724 .822
anxiety
.236** .036 1.266 1.181 1.357
absence
.075*
.030 1.078 1.016 1.144
anti-smoking test
score -.303** .075 .739 .638 .856
Notes. R2=.607 (Cox & Snell), .818 (Nagelkerke). Model χ² (8) = 42.0, p < .001. *
p <.05. **
p <.01
18
19. Practice 3
• Report:
• A discriminant analysis was conducted age, gender, number of
days from work in previous year, self-concept score, anxiety
score, and attitude to anti-smoking workplace policy as
predictors. A total of 438 cases were analyzed. The full
model significantly predicted whether an employee is a
smoker or non-smoker (χ² = 42.04, df = 8, p < .001),
accounting for between 60.7% and 81.8% on the variance in
the group membership with 92.6% non-smokers and 90.6%
smokers successfully predicted.