2. Part VI (Sub-part II)
Data Analysis, Interpretation
and Reporting
5/11/2022 2
3. Chapter Six: Data Analysis, Interpretation and Reporting
Data Management and Support Software
Descriptive Analysis
Inferential Analysis
Hypothesis Testing
Interpretation, scientific writing and reporting
5/11/2022 3
4. Data Analysis: Introduction
Once the data is ready for processing, the next step is to
choose appropriate analysis method and conduct the analysis.
Data analysis depends on the nature of the variable, the type
of data and the purpose of the analysis. The following issues
will affect the data analysis part of your research endeavor.
The type of data you have gathered, (i.e.
Nominal/Ordinal/Interval/Ratio)
Are the data paired such as before and after treatment?
Are they parametric or non-parametric?
Ranks, scores, or categories are generally non-parametric data.
Measurements that come from a population that is normally
distributed can usually be treated as parametric.
What are you looking for? differences, correlation etc?
5/11/2022 4
5. Data Analysis: Introduction
Simply put: Data analysis is the process of making
meaning from the data
Broadly classified, data analysis involves:
Quantitative analysis
Qualitative analysis
The quantitative analysis uses numeric
expressions/representations and manipulations of the
collected data.
The analysis could take descriptive or inferential form.
Based on number of variables involved, quantitative
analysis could be univariate, bivariate and/ or multivariate
analysis.
5/11/2022 5
6. Quantitative analysis
Descriptive vs Inferential analysis:
Descriptive analysis: refers to statistically describing,
aggregating, and presenting the constructs of interest or
associations between these constructs.
Inferential analysis: refers to the statistical estimation of
parameter values and testing of hypotheses (theory testing).
With respect to the number of variables:
Univariate analysis: only one variable is analyzed
Bivariate analysis: two variables are analyzed
Multivariate analysis: more than two variables are
included in the analysis process
It also varies with the four scales of measurement
5/11/2022 6
8. Reliability Analysis/Test (SPSS)
It helps measure consistency of an instrument.
Internal consistency is the most commonly used measure
of reliability
Factors that increase reliability
Number of items
High variation among individuals being tested
Clear instructions
Optimal testing situation
Analyze Scale Reliability Analysis select items
Statistics choose statistical tests Continue choose
from Alpha list OK
5/11/2022 8
9. Univariate analysis (Descriptive analysis)
• The following categories of the descriptive analysis are usually
used.
• Frequency distributions
• Measures of central tendency
• Measures of dispersion
• Shape of distribution
1) Frequency distributions (tables, bar graph, pie chart, histogram)
a) Frequency table- a table of a summary of the values of a variable
and the number of times the variable assumes an given value. It
has:
• Descriptive tile
• Clear labels for columns and rows
• Appropriate categories
• Presentation of frequencies and corresponding percentages
5/11/2022 9
10. Univariate analysis…
b) Pie charts and Bar charts- when data is nominal or
ordinal, we use pie chart or bar chart. However, only one
variable in pie chart and possibly more than one in bar
charts.
c) Histogram –Histograms are used when it is an interval
level data measurement.
We can also have line graphs to explore the variable(s).
5/11/2022 10
11. Univariate analysis…
5/11/2022 11
• Example: Frequency table (Leisure time preference)
Preference Frequency Percentage Cumulative
With friends 9 9.0 9.0
Sport activities 30 30.0 39.0
With family 40 40.0 79.0
Reading 21 21.0 100.0
Total 100 100.0
12. Example:
Bar Diagram: Lists the categories and presents the percent or
count of individuals who fall in each category.
5/11/2022 12
13. Example:
Pie Chart: Lists the categories and presents the percent or
count of individuals who fall in each category.
5/11/2022 13
14. Example:
Histogram: Overall pattern can be described by its shape,
center, and spread. The following age distribution is right
skewed. The center lies between 80 to 100. No outliers
5/11/2022 14
15. Frequency distributions in SPSS
Frequency tables: are found under the ‘analyze’ menu bar
(Analyze ---- Descriptive statistics ---- Frequencies)
Then, select variables and move them to ‘variable(s)’ dialog
box, choose from the options, display frequency tables, OK
Charts and graphs: two options
Analyze ---- Descriptive statistics --- Frequencies --- charts
Graphs --- Legacy dialogs --- charts/graphs (options)
5/11/2022 15
18. Univatiate analysis…
2) Measures of central tendency
Central tendency is an estimate of the center of a
distribution of values.
There are three major estimates of central
tendency: mean, median, and mode.
5/11/2022 18
19. Measures of central tendency…
1. Mean
For a data set, the mean is the sum of the values divided
by the number of values. The mean of a set of numbers
x1, x2... xn is typically denoted by , pronounced "x bar".
This mean is a type of arithmetic mean. The mean
describes the central location of the data; the arithmetic
mean is the "standard" average, often simply called the
"mean".
The other name is average
mainly for interval variables
very widely used and intuitively appealing
5/11/2022 19
20. Measures of central tendency…
2. Median
It is the middle value of the distribution when all items are
arranged in either ascending or descending order in terms
of value
mid-point value; arrange data from lowest to highest to
identify mid value; if two mid values, take the average
mean is sensitive to outliers but median is robust
5/11/2022 20
1
2
th
n
Med value
21. Measures of central tendency…
3. Mode
It is the value that occurs most frequently in the data set
3) Measures of dispersion
• It measures the amount of scatter or variationin a dataset
• Or it refers to the way values are spread around the central
tendency, for example, how tightly or how widely are the
values clustered around the mean.
• similar measures of central tendency may come from very
different distributions
5/11/2022 21
23. Measures of dispersion…
Common measures of dispersion include minimum,
maximum, range, variance and standard deviation.
But, the most frequently used in analysis are range
and standard deviation
Range = Maximum value – Minimum value
Range is sensitive to outliers
5/11/2022 23
24. Measures of dispersion…
Variance:
The variance is used as a measure of how far a set of
numbers are spread out from each other. It is one of
several descriptors of a probability distribution,
describing how far the numbers lie from the mean
(expected value). In particular, the variance is one of
the moments of a distribution.
5/11/2022 24
2
1
( )
( )
n
i
i
x x
Var x
n
25. Measures of dispersion…
Standard deviation:
It is a widely used measurement of variability or diversity used
in statistics and probability theory. It shows how much variation
or “dispersion" there is from the average (mean, or expected
value). A low standard deviation indicates that the data points
tend to be very close to the mean, whereas high standard
deviation indicates that the data are spread out over a large range
of values. The standard deviation of X is given by:
5/11/2022 25
A useful property of
standard deviation is
that, unlike variance, it is
expressed in the same
units as the data.
2
1
( )
( )
n
i
i
x x
SE x
n
26. Measures of dispersion
Coefficient of variation (CV):
In probability theory and statistics, the coefficient of
variation (CV) is a normalized measure of dispersion of a
probability distribution. It is also known as unitized risk or
the variation coefficient. The coefficient of variation (CV)
is defined as the ratio of the standard deviation to the
mean :
5/11/2022 26
SD
CV
Mean
27. Measures of shape of distribution
4) Measures of shape of distribution
skewness and kurtosis are the commonly used
measures of shape of distribution of a dataset.
Skweness:
It refers to symmetry or asymmetry of the
distribution.
The skewness value can be positive or negative, or
even undefined.
5/11/2022 27
28. Measures of shape of distribution…
Skewness:
Qualitatively, a negative skew indicates that the tail on the
left side of the probability density function is longer than
the right side and the bulk of the values (possibly including
the median) lie to the right of the mean.
A positive skew indicates that the tail on the right side is
longer than the left side and the bulk of the values lie to the
left of the mean. A zero value indicates that the values are
relatively evenly distributed on both sides of the mean,
typically but not necessarily implying a symmetric
distribution.
5/11/2022 28
29. Measures of shape of distribution…
The skewness of a random variable X is the third
standardized moment and defined as
The coefficient of Skewness is a measure for the degree
of symmetry in the variable distribution.
5/11/2022 29
3
1
3
( )
( 1)
n
i
i
x x
SK
n S
30. Measures of shape of distribution…
Kurtosis:
It refers to peakedness of the distribution.
It is a measure of the "peakedness" of the probability
distribution of a real-valued random variable.
Higher kurtosis means more of the variance is the result of
infrequent extreme deviation, as opposed to frequent
modestly sized deviations.
5/11/2022 30
4
1
4
( )
( 1)
n
i
i
x x
KU
n S
31. Measures of shape of distribution…
The coefficient of Kurtosis is a measure for the degree of
peakedness/flatness in the variable distribution.
5/11/2022 31
32. 32
Analyze Descriptive statistics Descriptives
Options (select your interest of analysis)
Central tendency, dispersion and shape in SPSS
33. The Normal Distribution Assumption
The Normal distribution – is a distribution that has
equal number of cases clustered around the mean. It is
the most useful distribution in statistics, and has the
following important properties:
1. Symmetry and bell-shaped
2. Mode, median, and mean coincide
3. As a corollary to (1), a fixed proportion of observations
lies between the mean and fixed units of standard
deviation.
5/11/2022 33
34. Normal distribution…
Z-Score (Standard Normal Curve) – is a normal
curve with mean = 0 and standard deviation,
S = 1. It is used to compare scores in two or more
distributions that have different means and standard
deviations.
z = (x – x (Bar))/s, where z = number of standard
deviations, ….
If the data is normally distributed, we employ
parametric tests
If the data is categorical or if the assumption of
normality does not hold, we use non-parametric tests
5/11/2022 34
38. Bivariate analysis
How do we analyze relationships between the two?
Bivariate analysis is analysis of two variables to examine if
they are correlated or if there is differences between values
analyzing relationships between two variables.
Remember co-variation does not always imply causation
5/11/2022 38
39. Bivariate analysis
• Examples:
• Do men earn more income than women?
• Does educational level affect attitudes toward
participation in labour union?
• Is income level correlated with life expectancy?
• Is parental educational level correlated with student
performance?
We need to conduct hypothesis testing to arrive at
conclusive results on issues like this.
5/11/2022 39
40. Hypotheses Testing
The following are the steps in hypothesis testing:
1. state the null hypothesis
2. choose an appropriate statistical test,
3. specify the level of statistical significance. (usually
this is o.1, 0.05 or 0.01) --- known as the α–level.
4. Decide to accept or to reject the null hypothesis
based on the findings.
We use different tests based on the nature of the dependent
and independent variables and nature of distribution of the
data.
During hypothesis testing, there is a possibility of
committing decision errors. The are two types of errors.
5/11/2022 40
41. Hypothesis…
"Type I error"
A type one error is a false positive (true) result.
If you use a parametric test on nonparametric data then
this could trick the test into seeing a significant effect
when there isn't one.
Or , it is a situation where we reject the null hypothesis that
is true.
The probability of committing Type I error is called
significance level (P-value).
This error requires more attention and important to avoid
5/11/2022 41
42. Hypothesis…
“Type II error”
It occurs when we accept a null hypothesis that is false.
However, this occurs if you use a nonparametric test on
parametric data then this could reduce the chance of
seeing a significant effect when there is one.
A type two error is a missed opportunity, i.e. we have
failed to detect a significant effect that truly does exist
This is least dangerous.
Summary; Using a parametric test in the wrong context
may lead to a type one error, a false positive.
Using a nonparametric test in the wrong context may lead
to a type two error, a missed opportunity.
5/11/2022 42
43. Hypothesis…
Reading P-value
It is the basis for deciding whether or not to reject the
null hypothesis.
P-values do not simply provide you with a Yes or No
answer, they provide a sense of the strength of the
evidence against the null hypothesis.
The lower the p-value, the stronger the evidence, usually
less than 0.05 or 0.01, the null hypothesis is rejected..
It is the probability that a statistical result as extreme as
the one observed would occur if the null hypothesis
were true.
5/11/2022 43
45. Hypothesis…
Parametric tests
T-test (one sample, independent sample, paired)
One-way ANOVA
Repeated ANOVA (for paired data)
Pearson correlation
There are many techniques of non-parametric tests
Chi-square for independence
Mann-Whitney Test
Wilcoxon Signed Rank Test
Kruskal-Wallis Test
Friedman Test
Spearman Rank Order Correlation
5/11/2022 45
46. Hypothesis…
Nominal Ordinal Interval/Ratio Dichotomous
Nominal Contingency table
Chi-square
Cramer’s V
Contingency
table
Chi-square
Cramer’s V
Z-test; T-test or
F-test
(If DV is
interval/ratio)
Contingency
table
Chi-square
Cramer’s V
Ordinal Contingency table
Chi-square
Cramer’s V
Spearman’s rho
(ƿ)
Spearman’s rho (ƿ) Spearman’s rho
(ƿ)
Interval/
Ratio
Z-test; T-test; or
F-test
(If DV)
Spearman’s rho
(ƿ)
Pearson’s r Spearman’s rho
(ƿ)
Dichoto
mous
Contingency table
Chi-square
Cramer’s V
Spearman’s rho
(ƿ)
Spearman’s rho (ƿ) Phi (ɸ)
47. Hypothesis…
Requirement Example of Situation Test to be Used
Compare to a target Is the average age of employees
more than 40 years?
Use a one sample
t-test
Compare two groups Do men earn more income than
women?
Use independent
samples t-test
Compare two groups with one
controlled intervention
Test scores before and after
training
Use Paired t-test
Compare more than two groups Compare amount of income
between four categories of
educational level
One way ANOVA
(F-test)
Association between two
categorical variables
Is there an association between
gender job grade?
Contingency table
Chi-square
Association between two
quantitative variables
Is there an association between
advertising & sales?
Pearson’s r
48. Hypothesis…
Contingency Table analysis (Cross-tabulation):
We look for differences among categories (hence
nominal or ordinal level measurement) of the
independent variables. That is, does the IV influence the
DV?
Contingency Table (Cross–tabulation) – a table of
percentage distribution with DV (in rows) and IV (in
columns).
It is a bivariate frequency distribution, where number of
cases that fall into each possible pairing of the values or
categories of the variables .
5/11/2022 48
49. Chi-square Test
Chi-square Test (Chi is pronounced "ky“ as is in
‘sky’)-
employed to test relationships between two variables
when the data is measured at the nominal or ordinal
level.
The Chi-square test for independence can be used in
situations where you have two categorical variables.
It works with the "simplest" form data.
Data such as gender or country, or data that has been
placed in categories, such as age group.
5/11/2022 49
50. Chi-square Test
Chi-square can be calculated as follows
χc
2 = Σ[(observed – expected)2⁄expected]
If the calculated chi-square is grater than the chi-
square obtained from the table, then we conclude
there is a relationship (that is, reject the Ho).
Remember, like in all hypothesis testing, the Chi-
square assumes that there is no relationship between
the DV and IV.
5/11/2022 50
51. Contingency Table and Chi-square in SPSS
Analyze= Custom Tables = Custom Tables =
Ok= Row and Column= Test Statistics = Tests
of independence (Chi-square) = Ok
Or
Analyze= Descriptive statistics= Crostabs=
choose DV into Rows and IV into Columns=
Statistics= Chi-square= OK
5/11/2022 51
52. Comparing two groups: T-tests
A t-test is a statistical hypothesis test. In such test, the test statistic
follows a Student’s T-distribution if the null hypothesis is true. The T-
statistic was introduced by W.S. Gossett under the pen name “Student”.
The most frequently used procedures for testing to determine
whether or not the means of two independent groups could
conceivably have come from the same population.
If you compute means for two samples, they will almost always
differ to some degree. The job of the t-test is to see whether they
differ by chance or whether the difference is real and reliable.
It is given by:.
5/11/2022 52
/
x
t
s n
53. T-test in SPSS
Parametric
Analyze Compare means One sample Test or
Independent samples test or paired samples test
• Non-parametric
• Analyze Nonparametric Tests Related samples or
Independent samples or One sample Automatically
compare observed data to be hypothesized
5/11/2022 53
54. Comparing more than two groups: ANOVA
ANOVA (similar to Difference of Means Test) is used
to examine variations among groups (and within
members of a group) with respect to some behavior
and see if the variations are statistically significant.
Groups may be like: male/female; economically
developed/ economically developing; smokers/non-
smokers; dry-lands/wet-lands; religious/non-religious,
High, medium, low; etc.
In AVOVA, the DV has an interval/scale measure,
while the IV has nominal or ordinal measure.
5/11/2022 54
55. ANOVA test
We use the F-test in ANOVA, given by
Fcalculated. =
Now, if Fcalc. > Ftable, then reject the Ho.
5/11/2022 55
56. ANOVA in SPSS
Analyze, Compare Means, One-Way ANOVA...
(Parametric test)
Analyze, Nonparametric, such as Kruskal-Wallis
one-way non-parametric ANOVA
Choose Post Hoc..., Post Hoc Tests, Choose Tukey
5/11/2022 56
57. Scatterplots/diagrams
Scatter plot/diagram:
values of the two variables plotted on each axis
strong relationships can be identified by scatter
diagrams
Four relationships can be identified
Positive linear
Negative linear
Non linear
No relationship at all
5/11/2022 57
58. Scatter plot of a positive association
Income and livestock ownership
0
10
20
30
40
50
60
0 200 400 600 800 1000 1200
Income
Livestock
59. Scatter plot of a negative association
Income & illitracy rates (%)
0
20
40
60
80
100
0 200 400 600 800 1000 1200
Income
Rate
of
illiteracry
(%)
60. Scatter plot of no association
Income and household size
0
2
4
6
8
10
12
0 200 400 600 800 1000 1200
income
hh
size
61. Scatter and line graph
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
63. Covariance and Correlations
The interest is about the association/relationship
between two variables or whether the vary together.
Example:
Does income of individuals increase as age increases??
Is the amount of sales associated with advertizing
expenditure?
Is crime related with socio-economic background?
Is student academic achievement associated with
parent’s educational level?
5/11/2022 63
64. Covariance
Covariance:
Covariance between X and Y refers to a measure of how much
two variables change together.
Covariance indicates how two variables are related. A positive
covariance means the variables are positively related, while a
negative covariance means the variables are inversely related.
The formula for calculating covariance of sample data is
shown below.
5/11/2022 64
1
( )( )
( , )
n
i i
i
x x y y
Cov x y
n
65. Correlation Analysis
Correlation:
Is concerned with the relationship/association, direction
and strength of the relationship between variables.
Correlation coefficients can be calculated to see the
direction and strength of the relationship
Depends on the nature of variables (parametric vs non-
parametric or numeric vs non-numeric)
5/11/2022 65
1
( )( )
( , )
var( ) var( )
n
i i
i
i i
x x y y
r x y
x x y y
66. Correlation...
The most commonly used is Pearson’s correlation coefficient
or Pearson’s r or simply correlation coefficient
Captures linear relationship between variables; non-linear
relationship are not captured
Lies between -1 & 1
r=0: no significant relationship
r=1: perfect positive relationship
r=-1: perfect negative relationship
Spearman’s rho/rank correlation coefficient (ρ)
mainly for ordinal variables (parametric)
Phi (Φ)correlation between two dichotomous variables
67. Correlations and Covariance in SPSS
Correlation
Analyze Correlate Bivariate Correlation
coefficients (choose depending on
parametric/nonparametric)
Covariance
Analyze Correlate Options Cross-product
deviations and covariances
5/11/2022 67
68. Regression analysis is a set of statistical techniques using past
observations to find (or estimate) the equation that best
summarizes the relationships among key economic variables.
The method requires that analysts:
(1) collect data on the variables in question,
(2) specify the form of the equation relating the variables,
(3) estimate the equation coefficients, and
(4) evaluate the accuracy of the equation
Regression analysis is used to:
Predict the value of a dependent variable based on the
value of at least one independent variable
Explain the impact of changes in an independent variable
on the dependent variable
Regression Analysis
69. Regression…
Regression Analysis is Used Primarily to Model
Causality and Provide Prediction
Predict the values of a dependent (response) variable
based on values of at least one independent
(explanatory) variable
Explain the effect of the independent variables on the
dependent variable
The relationship between X and Y can be shown on a
scatter diagram
5/11/2022 69
70. Simple Linear Regression Model
Only one independent variable, x
Relationship between x and y is described by a
linear function
Changes in y are assumed to be caused by
changes in x
Regression analysis serves three major purposes:
1. Description
2. Control
3. Prediction
71. ε
x
β
β
y 1
0
Linear component
Population Linear Regression
The population regression model:
Population
y
intercept
Population
Slope
Coefficient
Random
Error
term, or
residual
Dependent
Variable
Independe
nt Variable
Random Error
component
72. Regression…
Explanatory and Response Variables are Numeric
Relationship between the mean of the response
variable and the level of the explanatory variable
assumed to be approximately linear (straight line)
Model:
• b1 > 0 Positive Association
• b1 < 0 Negative Association
• b1 = 0 No Association
)
,
0
(
~
1
0
b
b N
x
Y
73. Critical Assumptions
Error term is normally distributed (Normality).
Error term has zero expected value or mean.
Error term has constant variance in each time period
and for all values of X (i.e. Homoscedasticity).
Error term’s value in one time period is unrelated to its
value in any other period (Autocorrelation).
The underlying relationship between the x variable
and the y variable is linear (Linearity)
5/11/2022 73
74. Ordinary Least Squares (OLS) Estimations
b0 Mean response when x=0 (y-
intercept)
b1 Change in mean response when x
increases by 1 unit (slope)
b0, b1 are unknown parameters (like )
b0+b1x Mean response when
explanatory variable takes on the value
x
75. x
b
b
ŷ 1
0
i
The sample regression line provides an
estimate of the population regression line
Estimated Regression Model
Estimate of
the
regression
intercept
Estimate of the
regression
slope
Estimated
(or
predicted) y
value
Independe
nt variable
The individual random error terms ei is a random variable
have a mean of zero
76. Interpretation of the Slope and the Intercept
b0 is the estimated average value of y
when the value of x is zero
b1 is the estimated change in the
average value of y as a result of a
one-unit change in x
77. Multiple Linear Regression
In simple linear regression we studied the relationship
between one explanatory variable and one response
variable.
Now, we look at situations where several explanatory
variables works together to explain the response variable.
78. Formal Statement of the Model
General regression model
• b0, b1, , bk are parameters
• X1, X2, …,Xk are known constants
• , the error terms are independent N(o, 2)
b
b
b
b
k
k x
x
x
Y
2
2
1
1
0
79. Estimating the parameters of the model
The values of the regression parameters bi are not known.
We estimate them from data.
As in the simple linear regression case, we use the least-
squares method to fit a linear function to the data.
The least-squares method chooses the b’s that make the
sum of squares of the residuals as small as possible.
k
k x
b
x
b
x
b
b
y
2
2
1
1
0
ˆ
80. 80
Testing for Overall Significance
Shows if Y Depends Linearly on All of the X Variables
Together as a Group
Use F Test Statistic
Hypotheses:
H0: b1 b2 … bk = 0 (No linear relationship)
H1: At least one bi 0 ( At least one independent variable affects
Y )
The Null Hypothesis is a Very Strong Statement
The Null Hypothesis is Almost Always Rejected
81. 81
Model Fitness Tests
Analysis of Variance and F Statistic
/( 1)
/( )
ExplainedVariation k
F
UnexplainedVariation n k
2
2
/( 1)
(1 ) /( )
R k
F
R n k
MSE
MSR
F
82. 82
Test for Overall Significance
ANOVA
df SS MS F Significance F
Regression 2 228014.6 114007.3 168.4712 1.65411E-09
Residual 12 8120.603 676.7169
Total 14 236135.2
k -1= 2 n - 1
p-value
k = 3, no of
parameters
83. The Coefficient of Determination – R2
The coefficient of determination is the proportion of
the total variance that is explained by the regression.
It is the ratio of the explained sum of squares to the total
sum of squares.
83
84. TSS
ESS
TSS
RSS
2
2
)
( Y
Y
e
i
i
R2 =
1-
1-
= =
0 R² 1
The higher R² is, the closer the estimated regression equation fits the
sample data.
•Since TSS, RSS and ESS are all non-negative (being squared deviations),
•and since ESS TSS, R² must lie in the interval
•A value of R² close to one shows a “good“ overall fit, whereas a value
near zero shows a failure of the estimated regression equation to explain
the variation in Y.
The Coefficient of Determination – R2
84
85. Multiple regression model building
Often we have many explanatory variables, and our goal is
to use these to explain the variation in the response
variable.
A model using just a few of the variables often predicts
about as well as the model using all the explanatory
variables.
86. Linear Regression in SPSS
Analyze Regression Linear select several
options
5/11/2022 86
88. Logistic regression
There are many important research topics for which the
dependent variable is "limited."
For example: voting, morbidity or mortality, and
participation data is not continuous or distributed
normally.
Binary logistic regression is a type of regression analysis
where the dependent variable is a dummy variable: coded 0
(did not vote) or 1(did vote)
Binary models
Discrete choice models, etc.
88
89. The Linear Probability Model
the linear probability model can be written as:
Y = + X + e ; where Y = (0, 1) or
P(y = 1|x) = b0 + xb
But:
The error terms are heteroskedastic
e is not normally distributed because Y takes on only two
values
The predicted probabilities can be greater than 1 or less
than 0
An alternative is to model the probability as a function, G(b0
+ xb), where 0<G(z)<1
89
90. 90
The Logit Model
A common choice for G(z) is the logistic function, which is the
cdf for a standard logistic random variable
G(z) = exp(z)/[1 + exp(z)] = L(z)
This case is referred to as a logit model, or a logistic regression
The estimated probability is given as:
ln[p/(1-p)] = + bX + e or
p = 1/[1 + exp(- - b X)]
91. The Logit Model
Where:
p is the probability that the event Y occurs, p(Y=1)
p/(1-p) is the "odds ratio"
ln[p/(1-p)] is the log odds ratio, or "logit"
The logistic distribution constrains the estimated
probabilities to lie between 0 and 1.
if you let + b X =0, then p = .50
as + b X gets really big, p approaches 1
as + b X gets really small, p approaches 0
91
93. The Probit Model
Another choice for G(z) is the standard normal
cumulative distribution function (cdf)
G(z) = F(z) ≡ ∫f(v)dv, where f(z) is the standard normal,
so f(z) = (2p)-1/2exp(-z2/2)
This case is referred to as a probit model
Since discrete choice models are nonlinear models, they
cannot be estimated by OLS method
we use maximum likelihood estimation
93
94. 94
Probits and Logits
Both the probit and logit are nonlinear and require
maximum likelihood estimation
No real reason to prefer one over the other
Both functions have similar shapes – they are increasing
in z, most quickly around 0
Traditionally we saw more use of the logit, mainly
because the logistic function was easier to compute.
Today, probit is easy to compute with standard packages,
so is also popular
95. Interpreting Coefficients
In general we care about the effect of x on P(y = 1|x), that
is, we care about ∂p/ ∂x
For the linear case, this is easily computed as the
coefficient on x
In the case of Logit since:
[p/(1-p)] = exp()+exp(b)exp(X)+exp(e)
The slope coefficient (b) is interpreted as the rate of
change in the "log odds" as X changes
exp(b) is the effect of the independent variable on the
"odds ratio"
95
96. 96
The Likelihood Ratio Test
Unlike the LPM, where we can compute F statistics to test
exclusion restrictions, we need a new type of test
Maximum likelihood estimation (MLE), will always
produce a log-likelihood, L
Just as in an F test, you estimate the restricted and
unrestricted model, then form
LR = 2(Lur – Lr) ~ c2
q
97. 97
Goodness of Fit
Unlike the LPM, where we can compute an R2 to judge
goodness of fit, we need new measures of goodness of fit
One possibility is a pseudo R2 based on the log likelihood and
defined as 1 – Lur/Lr
Can also look at the percent correctly predicted.
98. Extensions
Unordered multiple (j>2) choices: travel mode, treatment
choice, etc., should be analyzed with the multinomial
logit model
Ordered multiple (j>2) choices: opinion/attitude surveys,
rankings,etc., should be analyzed with the ordered logit
model
Tobit Model used when the dependent variable is being
censored.
y* = xb + u, u|x ~ Normal(0,s2)
we only observe y = max(0, y*)
98
99. Limited dependent variable models in SPSS
Analyze Regression choose the model of your
interest from the list other than ‘Linear’
5/11/2022 99
100. Analyzing qualitative Data
• There is considerable amount of interview, focus group
discussion and/or text-based data and images that require
analysis.
• Creswell (2003) suggests that it is useful to look at the
codes that have emerged according to:
Codes readers would expect to find;
Codes that are uprising; and
Codes that address a larger theoretical perspective in their research.
Then, follow the next steps
Identifying themes
Coding data (reducing data to manageable size)
Developing a description from the data
Defining themes from the data
Connecting and interrelating themes
101. Analyzing qualitative…
Further activities
Noting reflections in the margins
Sorting and sifting through the materials to identify similar
phrases, relationships, patterns, themes, commonalities, &
differences
Isolating patterns, processes, commonalities, & differences
and incorporating methods to further explore them into the
next wave of data collection
Gradually developing a small set of generalizations about
what consistently appears in the data
Confronting those generalizations with a formalized body of
knowledge in the form of constructs or theories
5/11/2022 101