BRM_Data Analysis, Interpretation and Reporting Part II.ppt

Alpha University College
5/11/2022 1
Business Research Methods

Part VI (Sub-part II)
Data Analysis, Interpretation
and Reporting
5/11/2022 2

Chapter Six: Data Analysis, Interpretation and Reporting
Data Management and Support Software
Descriptive Analysis
Inferential Analysis
Hypothesis Testing
Interpretation, scientific writing and reporting
5/11/2022 3

Data Analysis: Introduction
 Once the data is ready for processing, the next step is to
choose appropriate analysis method and conduct the analysis.
 Data analysis depends on the nature of the variable, the type
of data and the purpose of the analysis. The following issues
will affect the data analysis part of your research endeavor.
 The type of data you have gathered, (i.e.
Nominal/Ordinal/Interval/Ratio)
 Are the data paired such as before and after treatment?
 Are they parametric or non-parametric?
 Ranks, scores, or categories are generally non-parametric data.
 Measurements that come from a population that is normally
distributed can usually be treated as parametric.
 What are you looking for? differences, correlation etc?
5/11/2022 4

Data Analysis: Introduction
 Simply put: Data analysis is the process of making
meaning from the data
 Broadly classified, data analysis involves:
 Quantitative analysis
 Qualitative analysis
 The quantitative analysis uses numeric
expressions/representations and manipulations of the
collected data.
 The analysis could take descriptive or inferential form.
 Based on number of variables involved, quantitative
analysis could be univariate, bivariate and/ or multivariate
analysis.
5/11/2022 5

Quantitative analysis
 Descriptive vs Inferential analysis:
 Descriptive analysis: refers to statistically describing,
aggregating, and presenting the constructs of interest or
associations between these constructs.
 Inferential analysis: refers to the statistical estimation of
parameter values and testing of hypotheses (theory testing).
 With respect to the number of variables:
 Univariate analysis: only one variable is analyzed
 Bivariate analysis: two variables are analyzed
 Multivariate analysis: more than two variables are
included in the analysis process
 It also varies with the four scales of measurement
5/11/2022 6

Scales of Measurement & Descriptive Statistics

Reliability Analysis/Test (SPSS)
 It helps measure consistency of an instrument.
 Internal consistency is the most commonly used measure
of reliability
 Factors that increase reliability
 Number of items
 High variation among individuals being tested
 Clear instructions
 Optimal testing situation
 Analyze  Scale  Reliability Analysis  select items 
Statistics  choose statistical tests  Continue  choose
from Alpha list  OK
5/11/2022 8

Univariate analysis (Descriptive analysis)
• The following categories of the descriptive analysis are usually
used.
• Frequency distributions
• Measures of central tendency
• Measures of dispersion
• Shape of distribution
1) Frequency distributions (tables, bar graph, pie chart, histogram)
a) Frequency table- a table of a summary of the values of a variable
and the number of times the variable assumes an given value. It
has:
• Descriptive tile
• Clear labels for columns and rows
• Appropriate categories
• Presentation of frequencies and corresponding percentages
5/11/2022 9

Univariate analysis…
b) Pie charts and Bar charts- when data is nominal or
ordinal, we use pie chart or bar chart. However, only one
variable in pie chart and possibly more than one in bar
charts.
c) Histogram –Histograms are used when it is an interval
level data measurement.
 We can also have line graphs to explore the variable(s).
5/11/2022 10

Univariate analysis…
5/11/2022 11
• Example: Frequency table (Leisure time preference)
Preference Frequency Percentage Cumulative
With friends 9 9.0 9.0
Sport activities 30 30.0 39.0
With family 40 40.0 79.0
Reading 21 21.0 100.0
Total 100 100.0

Example:
Bar Diagram: Lists the categories and presents the percent or
count of individuals who fall in each category.
5/11/2022 12

Example:
Pie Chart: Lists the categories and presents the percent or
count of individuals who fall in each category.
5/11/2022 13

Example:
Histogram: Overall pattern can be described by its shape,
center, and spread. The following age distribution is right
skewed. The center lies between 80 to 100. No outliers
5/11/2022 14

Frequency distributions in SPSS
 Frequency tables: are found under the ‘analyze’ menu bar
(Analyze ---- Descriptive statistics ---- Frequencies)
 Then, select variables and move them to ‘variable(s)’ dialog
box, choose from the options, display frequency tables, OK
 Charts and graphs: two options
 Analyze ---- Descriptive statistics --- Frequencies --- charts
 Graphs --- Legacy dialogs --- charts/graphs (options)
5/11/2022 15

16
Analyze Descriptive statistics Frequency
Frequency distributions in SPSS

Univatiate analysis…
2) Measures of central tendency
 Central tendency is an estimate of the center of a
distribution of values.
 There are three major estimates of central
tendency: mean, median, and mode.
5/11/2022 18

Measures of central tendency…
1. Mean
 For a data set, the mean is the sum of the values divided
by the number of values. The mean of a set of numbers
x1, x2... xn is typically denoted by , pronounced "x bar".
This mean is a type of arithmetic mean. The mean
describes the central location of the data; the arithmetic
mean is the "standard" average, often simply called the
"mean".
 The other name is average
 mainly for interval variables
 very widely used and intuitively appealing
5/11/2022 19

2. Median
 It is the middle value of the distribution when all items are
arranged in either ascending or descending order in terms
of value
 mid-point value; arrange data from lowest to highest to
identify mid value; if two mid values, take the average
 mean is sensitive to outliers but median is robust
5/11/2022 20
1
2
th
n
Med value

 
  
 

3. Mode
 It is the value that occurs most frequently in the data set
3) Measures of dispersion
• It measures the amount of scatter or variationin a dataset
• Or it refers to the way values are spread around the central
tendency, for example, how tightly or how widely are the
values clustered around the mean.
• similar measures of central tendency may come from very
different distributions
5/11/2022 21

Measures of dispersion...
Have the same mean
But different dispersions

Measures of dispersion…
 Common measures of dispersion include minimum,
maximum, range, variance and standard deviation.
 But, the most frequently used in analysis are range
and standard deviation
 Range = Maximum value – Minimum value
 Range is sensitive to outliers
5/11/2022 23

 Variance:
 The variance is used as a measure of how far a set of
numbers are spread out from each other. It is one of
several descriptors of a probability distribution,
describing how far the numbers lie from the mean
(expected value). In particular, the variance is one of
the moments of a distribution.
5/11/2022 24
2
1
( )
( )
n
i
i
x x
Var x
n





 Standard deviation:
 It is a widely used measurement of variability or diversity used
in statistics and probability theory. It shows how much variation
or “dispersion" there is from the average (mean, or expected
value). A low standard deviation indicates that the data points
tend to be very close to the mean, whereas high standard
deviation indicates that the data are spread out over a large range
of values. The standard deviation of X is given by:
5/11/2022 25
A useful property of
standard deviation is
that, unlike variance, it is
expressed in the same
units as the data.
2
1
( )
( )
n
i
i
x x
SE x
n





Measures of dispersion
 Coefficient of variation (CV):
 In probability theory and statistics, the coefficient of
variation (CV) is a normalized measure of dispersion of a
probability distribution. It is also known as unitized risk or
the variation coefficient. The coefficient of variation (CV)
is defined as the ratio of the standard deviation to the
mean :
5/11/2022 26
SD
CV
Mean
 
  
 

Measures of shape of distribution
4) Measures of shape of distribution
 skewness and kurtosis are the commonly used
measures of shape of distribution of a dataset.
 Skweness:
 It refers to symmetry or asymmetry of the
distribution.
 The skewness value can be positive or negative, or
even undefined.
5/11/2022 27

Measures of shape of distribution…
 Skewness:
 Qualitatively, a negative skew indicates that the tail on the
left side of the probability density function is longer than
the right side and the bulk of the values (possibly including
the median) lie to the right of the mean.
 A positive skew indicates that the tail on the right side is
longer than the left side and the bulk of the values lie to the
left of the mean. A zero value indicates that the values are
relatively evenly distributed on both sides of the mean,
typically but not necessarily implying a symmetric
distribution.
5/11/2022 28

 The skewness of a random variable X is the third
standardized moment and defined as
 The coefficient of Skewness is a measure for the degree
of symmetry in the variable distribution.
5/11/2022 29
3
1
3
( )
( 1)
n
i
i
x x
SK
n S






 Kurtosis:
 It refers to peakedness of the distribution.
 It is a measure of the "peakedness" of the probability
distribution of a real-valued random variable.
 Higher kurtosis means more of the variance is the result of
infrequent extreme deviation, as opposed to frequent
modestly sized deviations.
5/11/2022 30
4
1
4
( )
( 1)
n
i
i
x x
KU
n S






 The coefficient of Kurtosis is a measure for the degree of
peakedness/flatness in the variable distribution.
5/11/2022 31

32
Analyze Descriptive statistics Descriptives
Options (select your interest of analysis)
Central tendency, dispersion and shape in SPSS

The Normal Distribution Assumption
The Normal distribution – is a distribution that has
equal number of cases clustered around the mean. It is
the most useful distribution in statistics, and has the
following important properties:
1. Symmetry and bell-shaped
2. Mode, median, and mean coincide
3. As a corollary to (1), a fixed proportion of observations
lies between the mean and fixed units of standard
deviation.
5/11/2022 33

Normal distribution…
 Z-Score (Standard Normal Curve) – is a normal
curve with mean = 0 and standard deviation,
S = 1. It is used to compare scores in two or more
distributions that have different means and standard
deviations.
z = (x – x (Bar))/s, where z = number of standard
deviations, ….
 If the data is normally distributed, we employ
parametric tests
 If the data is categorical or if the assumption of
normality does not hold, we use non-parametric tests
5/11/2022 34

Using histogram to test the normality of the
data
5/11/2022 35

Checking for normality with a Q-Q plot
5/11/2022 36

Analyze, Descriptive Statistics, Explore…
5/11/2022 37

Bivariate analysis
How do we analyze relationships between the two?
Bivariate analysis is analysis of two variables to examine if
they are correlated or if there is differences between values
analyzing relationships between two variables.
Remember co-variation does not always imply causation
5/11/2022 38

Bivariate analysis
• Examples:
• Do men earn more income than women?
• Does educational level affect attitudes toward
participation in labour union?
• Is income level correlated with life expectancy?
• Is parental educational level correlated with student
performance?
 We need to conduct hypothesis testing to arrive at
conclusive results on issues like this.
5/11/2022 39

Hypotheses Testing
 The following are the steps in hypothesis testing:
1. state the null hypothesis
2. choose an appropriate statistical test,
3. specify the level of statistical significance. (usually
this is o.1, 0.05 or 0.01) --- known as the α–level.
4. Decide to accept or to reject the null hypothesis
based on the findings.
 We use different tests based on the nature of the dependent
and independent variables and nature of distribution of the
data.
 During hypothesis testing, there is a possibility of
committing decision errors. The are two types of errors.
5/11/2022 40

Hypothesis…
 "Type I error"
 A type one error is a false positive (true) result.
 If you use a parametric test on nonparametric data then
this could trick the test into seeing a significant effect
when there isn't one.
 Or , it is a situation where we reject the null hypothesis that
is true.
 The probability of committing Type I error is called
significance level (P-value).
 This error requires more attention and important to avoid
5/11/2022 41

Hypothesis…
 “Type II error”
 It occurs when we accept a null hypothesis that is false.
 However, this occurs if you use a nonparametric test on
parametric data then this could reduce the chance of
seeing a significant effect when there is one.
 A type two error is a missed opportunity, i.e. we have
failed to detect a significant effect that truly does exist
 This is least dangerous.
 Summary; Using a parametric test in the wrong context
may lead to a type one error, a false positive.
 Using a nonparametric test in the wrong context may lead
to a type two error, a missed opportunity.
5/11/2022 42

Hypothesis…
 Reading P-value
 It is the basis for deciding whether or not to reject the
null hypothesis.
 P-values do not simply provide you with a Yes or No
answer, they provide a sense of the strength of the
evidence against the null hypothesis.
 The lower the p-value, the stronger the evidence, usually
less than 0.05 or 0.01, the null hypothesis is rejected..
 It is the probability that a statistical result as extreme as
the one observed would occur if the null hypothesis
were true.
5/11/2022 43

Hypothesis…
Parametric tests
 T-test (one sample, independent sample, paired)
 One-way ANOVA
 Repeated ANOVA (for paired data)
 Pearson correlation
There are many techniques of non-parametric tests
 Chi-square for independence
 Mann-Whitney Test
 Wilcoxon Signed Rank Test
 Kruskal-Wallis Test
 Friedman Test
 Spearman Rank Order Correlation
5/11/2022 45

Hypothesis…
Nominal Ordinal Interval/Ratio Dichotomous
Nominal Contingency table
Chi-square
Cramer’s V
Contingency
table
Chi-square
Cramer’s V
Z-test; T-test or
F-test
(If DV is
interval/ratio)
Contingency
table
Chi-square
Cramer’s V
Ordinal Contingency table
Chi-square
Cramer’s V
Spearman’s rho
(ƿ)
Spearman’s rho (ƿ) Spearman’s rho
(ƿ)
Interval/
Ratio
Z-test; T-test; or
F-test
(If DV)
Spearman’s rho
(ƿ)
Pearson’s r Spearman’s rho
(ƿ)
Dichoto
mous
Contingency table
Chi-square
Cramer’s V
Spearman’s rho
(ƿ)
Spearman’s rho (ƿ) Phi (ɸ)

Hypothesis…
Requirement Example of Situation Test to be Used
Compare to a target Is the average age of employees
more than 40 years?
Use a one sample
t-test
Compare two groups Do men earn more income than
women?
Use independent
samples t-test
Compare two groups with one
controlled intervention
Test scores before and after
training
Use Paired t-test
Compare more than two groups Compare amount of income
between four categories of
educational level
One way ANOVA
(F-test)
Association between two
categorical variables
Is there an association between
gender job grade?
Contingency table
Chi-square
Association between two
quantitative variables
Is there an association between
advertising & sales?
Pearson’s r

Hypothesis…
Contingency Table analysis (Cross-tabulation):
 We look for differences among categories (hence
nominal or ordinal level measurement) of the
independent variables. That is, does the IV influence the
DV?
 Contingency Table (Cross–tabulation) – a table of
percentage distribution with DV (in rows) and IV (in
columns).
 It is a bivariate frequency distribution, where number of
cases that fall into each possible pairing of the values or
categories of the variables .
5/11/2022 48

Chi-square Test
 Chi-square Test (Chi is pronounced "ky“ as is in
‘sky’)-
 employed to test relationships between two variables
when the data is measured at the nominal or ordinal
level.
 The Chi-square test for independence can be used in
situations where you have two categorical variables.
 It works with the "simplest" form data.
 Data such as gender or country, or data that has been
placed in categories, such as age group.
5/11/2022 49

Chi-square Test
 Chi-square can be calculated as follows
χc
2 = Σ[(observed – expected)2⁄expected]
 If the calculated chi-square is grater than the chi-
square obtained from the table, then we conclude
there is a relationship (that is, reject the Ho).
Remember, like in all hypothesis testing, the Chi-
square assumes that there is no relationship between
the DV and IV.
5/11/2022 50

Contingency Table and Chi-square in SPSS
 Analyze= Custom Tables = Custom Tables =
Ok= Row and Column= Test Statistics = Tests
of independence (Chi-square) = Ok
 Or
 Analyze= Descriptive statistics= Crostabs=
choose DV into Rows and IV into Columns=
Statistics= Chi-square= OK
5/11/2022 51

Comparing two groups: T-tests
 A t-test is a statistical hypothesis test. In such test, the test statistic
follows a Student’s T-distribution if the null hypothesis is true. The T-
statistic was introduced by W.S. Gossett under the pen name “Student”.
 The most frequently used procedures for testing to determine
whether or not the means of two independent groups could
conceivably have come from the same population.
 If you compute means for two samples, they will almost always
differ to some degree. The job of the t-test is to see whether they
differ by chance or whether the difference is real and reliable.
 It is given by:.
5/11/2022 52
/
x
t
s n




T-test in SPSS
 Parametric
 Analyze Compare means  One sample Test or
Independent samples test or paired samples test
• Non-parametric
• Analyze  Nonparametric Tests  Related samples or
Independent samples or One sample  Automatically
compare observed data to be hypothesized
5/11/2022 53

Comparing more than two groups: ANOVA
 ANOVA (similar to Difference of Means Test) is used
to examine variations among groups (and within
members of a group) with respect to some behavior
and see if the variations are statistically significant.
 Groups may be like: male/female; economically
developed/ economically developing; smokers/non-
smokers; dry-lands/wet-lands; religious/non-religious,
High, medium, low; etc.
 In AVOVA, the DV has an interval/scale measure,
while the IV has nominal or ordinal measure.
5/11/2022 54

ANOVA test
We use the F-test in ANOVA, given by
Fcalculated. =
Now, if Fcalc. > Ftable, then reject the Ho.
5/11/2022 55

ANOVA in SPSS
 Analyze, Compare Means, One-Way ANOVA...
(Parametric test)
 Analyze, Nonparametric, such as Kruskal-Wallis
one-way non-parametric ANOVA
 Choose Post Hoc..., Post Hoc Tests, Choose Tukey
5/11/2022 56

Scatterplots/diagrams
 Scatter plot/diagram:
 values of the two variables plotted on each axis
 strong relationships can be identified by scatter
diagrams
 Four relationships can be identified
 Positive linear
 Negative linear
 Non linear
 No relationship at all
5/11/2022 57

Scatter plot of a positive association
Income and livestock ownership
0
10
20
30
40
50
60
0 200 400 600 800 1000 1200
Income
Livestock

Scatter plot of a negative association
Income & illitracy rates (%)
0
20
40
60
80
100
0 200 400 600 800 1000 1200
Income
Rate
of
illiteracry
(%)

Scatter plot of no association
Income and household size
0
2
4
6
8
10
12
0 200 400 600 800 1000 1200
income
hh
size

Scatter and line graph
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship

Scatter plot in SPSS
 Graphs  Legacy Dialogs  Scatter/Dot
5/11/2022 62

Covariance and Correlations
 The interest is about the association/relationship
between two variables or whether the vary together.
 Example:
 Does income of individuals increase as age increases??
 Is the amount of sales associated with advertizing
expenditure?
 Is crime related with socio-economic background?
 Is student academic achievement associated with
parent’s educational level?
5/11/2022 63

Covariance
 Covariance:
 Covariance between X and Y refers to a measure of how much
two variables change together.
 Covariance indicates how two variables are related. A positive
covariance means the variables are positively related, while a
negative covariance means the variables are inversely related.
The formula for calculating covariance of sample data is
shown below.
5/11/2022 64
1
( )( )
( , )
n
i i
i
x x y y
Cov x y
n

 



Correlation Analysis
 Correlation:
 Is concerned with the relationship/association, direction
and strength of the relationship between variables.
 Correlation coefficients can be calculated to see the
direction and strength of the relationship
 Depends on the nature of variables (parametric vs non-
parametric or numeric vs non-numeric)
5/11/2022 65
1
( )( )
( , )
var( ) var( )
n
i i
i
i i
x x y y
r x y
x x y y

 

 


Correlation...
 The most commonly used is Pearson’s correlation coefficient
or Pearson’s r or simply correlation coefficient
 Captures linear relationship between variables; non-linear
relationship are not captured
 Lies between -1 & 1
 r=0: no significant relationship
 r=1: perfect positive relationship
 r=-1: perfect negative relationship
 Spearman’s rho/rank correlation coefficient (ρ)
 mainly for ordinal variables (parametric)
 Phi (Φ)correlation between two dichotomous variables

Correlations and Covariance in SPSS
 Correlation
 Analyze  Correlate  Bivariate  Correlation
coefficients (choose depending on
parametric/nonparametric)
 Covariance
 Analyze  Correlate  Options  Cross-product
deviations and covariances
5/11/2022 67

 Regression analysis is a set of statistical techniques using past
observations to find (or estimate) the equation that best
summarizes the relationships among key economic variables.
 The method requires that analysts:
 (1) collect data on the variables in question,
 (2) specify the form of the equation relating the variables,
 (3) estimate the equation coefficients, and
 (4) evaluate the accuracy of the equation
 Regression analysis is used to:
 Predict the value of a dependent variable based on the
value of at least one independent variable
 Explain the impact of changes in an independent variable
on the dependent variable
Regression Analysis

Regression…
 Regression Analysis is Used Primarily to Model
Causality and Provide Prediction
 Predict the values of a dependent (response) variable
based on values of at least one independent
(explanatory) variable
 Explain the effect of the independent variables on the
dependent variable
 The relationship between X and Y can be shown on a
scatter diagram
5/11/2022 69

Simple Linear Regression Model
 Only one independent variable, x
 Relationship between x and y is described by a
linear function
 Changes in y are assumed to be caused by
changes in x
 Regression analysis serves three major purposes:
1. Description
2. Control
3. Prediction

ε
x
β
β
y 1
0 


Linear component
Population Linear Regression
The population regression model:
Population
y
intercept
Population
Slope
Coefficient
Random
Error
term, or
residual
Dependent
Variable
Independe
nt Variable
Random Error
component

Regression…
 Explanatory and Response Variables are Numeric
 Relationship between the mean of the response
variable and the level of the explanatory variable
assumed to be approximately linear (straight line)
 Model:
• b1 > 0  Positive Association
• b1 < 0  Negative Association
• b1 = 0  No Association
)
,
0
(
~
1
0 


b
b N
x
Y 



Critical Assumptions
 Error term is normally distributed (Normality).
 Error term has zero expected value or mean.
 Error term has constant variance in each time period
and for all values of X (i.e. Homoscedasticity).
 Error term’s value in one time period is unrelated to its
value in any other period (Autocorrelation).
 The underlying relationship between the x variable
and the y variable is linear (Linearity)
5/11/2022 73

Ordinary Least Squares (OLS) Estimations
 b0  Mean response when x=0 (y-
intercept)
 b1  Change in mean response when x
increases by 1 unit (slope)
 b0, b1 are unknown parameters (like )
 b0+b1x  Mean response when
explanatory variable takes on the value
x

x
b
b
ŷ 1
0
i 

The sample regression line provides an
estimate of the population regression line
Estimated Regression Model
Estimate of
the
regression
intercept
Estimate of the
regression
slope
Estimated
(or
predicted) y
value
Independe
nt variable
The individual random error terms ei is a random variable
have a mean of zero

Interpretation of the Slope and the Intercept
b0 is the estimated average value of y
when the value of x is zero
b1 is the estimated change in the
average value of y as a result of a
one-unit change in x

Multiple Linear Regression
 In simple linear regression we studied the relationship
between one explanatory variable and one response
variable.
 Now, we look at situations where several explanatory
variables works together to explain the response variable.

Formal Statement of the Model
 General regression model
• b0, b1, , bk are parameters
• X1, X2, …,Xk are known constants
•  , the error terms are independent N(o, 2)

b
b
b
b 




 k
k x
x
x
Y 
2
2
1
1
0

Estimating the parameters of the model
 The values of the regression parameters bi are not known.
We estimate them from data.
 As in the simple linear regression case, we use the least-
squares method to fit a linear function to the data.
 The least-squares method chooses the b’s that make the
sum of squares of the residuals as small as possible.
k
k x
b
x
b
x
b
b
y 



 
2
2
1
1
0
ˆ

80
Testing for Overall Significance
 Shows if Y Depends Linearly on All of the X Variables
Together as a Group
 Use F Test Statistic
 Hypotheses:
 H0: b1 b2 … bk = 0 (No linear relationship)
 H1: At least one bi  0 ( At least one independent variable affects
Y )
 The Null Hypothesis is a Very Strong Statement
 The Null Hypothesis is Almost Always Rejected

81
Model Fitness Tests
Analysis of Variance and F Statistic
/( 1)
/( )
ExplainedVariation k
F
UnexplainedVariation n k



2
2
/( 1)
(1 ) /( )
R k
F
R n k


 
MSE
MSR
F 

82
Test for Overall Significance
ANOVA
df SS MS F Significance F
Regression 2 228014.6 114007.3 168.4712 1.65411E-09
Residual 12 8120.603 676.7169
Total 14 236135.2
k -1= 2 n - 1
p-value
k = 3, no of
parameters

The Coefficient of Determination – R2
 The coefficient of determination is the proportion of
the total variance that is explained by the regression.
 It is the ratio of the explained sum of squares to the total
sum of squares.
83

TSS
ESS
TSS
RSS
2
2
)
( Y
Y
e
i
i



R2 =
1-
1-
= =
0  R²  1
The higher R² is, the closer the estimated regression equation fits the
sample data.
•Since TSS, RSS and ESS are all non-negative (being squared deviations),
•and since ESS  TSS, R² must lie in the interval
•A value of R² close to one shows a “good“ overall fit, whereas a value
near zero shows a failure of the estimated regression equation to explain
the variation in Y.
The Coefficient of Determination – R2
84

Multiple regression model building
 Often we have many explanatory variables, and our goal is
to use these to explain the variation in the response
variable.
 A model using just a few of the variables often predicts
about as well as the model using all the explanatory
variables.

Linear Regression in SPSS
 Analyze  Regression  Linear  select several
options
5/11/2022 86

Dichotomous variables
Ordered Choice
Intensity measurement
Limited Dependent Variables
87

Logistic regression
 There are many important research topics for which the
dependent variable is "limited."
 For example: voting, morbidity or mortality, and
participation data is not continuous or distributed
normally.
 Binary logistic regression is a type of regression analysis
where the dependent variable is a dummy variable: coded 0
(did not vote) or 1(did vote)
 Binary models
 Discrete choice models, etc.
88

The Linear Probability Model
the linear probability model can be written as:
 Y =  + X + e ; where Y = (0, 1) or
 P(y = 1|x) = b0 + xb
 But:
 The error terms are heteroskedastic
 e is not normally distributed because Y takes on only two
values
 The predicted probabilities can be greater than 1 or less
than 0
 An alternative is to model the probability as a function, G(b0
+ xb), where 0<G(z)<1
89

90
The Logit Model
 A common choice for G(z) is the logistic function, which is the
cdf for a standard logistic random variable
 G(z) = exp(z)/[1 + exp(z)] = L(z)
 This case is referred to as a logit model, or a logistic regression
 The estimated probability is given as:
ln[p/(1-p)] =  + bX + e or
p = 1/[1 + exp(- - b X)]

The Logit Model
 Where:
 p is the probability that the event Y occurs, p(Y=1)
 p/(1-p) is the "odds ratio"
 ln[p/(1-p)] is the log odds ratio, or "logit"
The logistic distribution constrains the estimated
probabilities to lie between 0 and 1.
 if you let  + b X =0, then p = .50
 as  + b X gets really big, p approaches 1
 as  + b X gets really small, p approaches 0
91

The Probit Model
 Another choice for G(z) is the standard normal
cumulative distribution function (cdf)
 G(z) = F(z) ≡ ∫f(v)dv, where f(z) is the standard normal,
so f(z) = (2p)-1/2exp(-z2/2)
 This case is referred to as a probit model
 Since discrete choice models are nonlinear models, they
cannot be estimated by OLS method
 we use maximum likelihood estimation
93

94
Probits and Logits
 Both the probit and logit are nonlinear and require
maximum likelihood estimation
 No real reason to prefer one over the other
 Both functions have similar shapes – they are increasing
in z, most quickly around 0
 Traditionally we saw more use of the logit, mainly
because the logistic function was easier to compute.
 Today, probit is easy to compute with standard packages,
so is also popular

Interpreting Coefficients
 In general we care about the effect of x on P(y = 1|x), that
is, we care about ∂p/ ∂x
 For the linear case, this is easily computed as the
coefficient on x
In the case of Logit since:
[p/(1-p)] = exp()+exp(b)exp(X)+exp(e)
The slope coefficient (b) is interpreted as the rate of
change in the "log odds" as X changes
exp(b) is the effect of the independent variable on the
"odds ratio"
95

96
The Likelihood Ratio Test
 Unlike the LPM, where we can compute F statistics to test
exclusion restrictions, we need a new type of test
 Maximum likelihood estimation (MLE), will always
produce a log-likelihood, L
 Just as in an F test, you estimate the restricted and
unrestricted model, then form
 LR = 2(Lur – Lr) ~ c2
q

97
Goodness of Fit
 Unlike the LPM, where we can compute an R2 to judge
goodness of fit, we need new measures of goodness of fit
 One possibility is a pseudo R2 based on the log likelihood and
defined as 1 – Lur/Lr
 Can also look at the percent correctly predicted.

Extensions
 Unordered multiple (j>2) choices: travel mode, treatment
choice, etc., should be analyzed with the multinomial
logit model
 Ordered multiple (j>2) choices: opinion/attitude surveys,
rankings,etc., should be analyzed with the ordered logit
model
 Tobit Model used when the dependent variable is being
censored.
 y* = xb + u, u|x ~ Normal(0,s2)
 we only observe y = max(0, y*)
98

Limited dependent variable models in SPSS
 Analyze  Regression  choose the model of your
interest from the list other than ‘Linear’
5/11/2022 99

Analyzing qualitative Data
• There is considerable amount of interview, focus group
discussion and/or text-based data and images that require
analysis.
• Creswell (2003) suggests that it is useful to look at the
codes that have emerged according to:
 Codes readers would expect to find;
 Codes that are uprising; and
 Codes that address a larger theoretical perspective in their research.
 Then, follow the next steps
 Identifying themes
 Coding data (reducing data to manageable size)
 Developing a description from the data
 Defining themes from the data
 Connecting and interrelating themes

Analyzing qualitative…
 Further activities
 Noting reflections in the margins
 Sorting and sifting through the materials to identify similar
phrases, relationships, patterns, themes, commonalities, &
differences
 Isolating patterns, processes, commonalities, & differences
and incorporating methods to further explore them into the
next wave of data collection
 Gradually developing a small set of generalizations about
what consistently appears in the data
 Confronting those generalizations with a formalized body of
knowledge in the form of constructs or theories
5/11/2022 101

BRM_Data Analysis, Interpretation and Reporting Part II.ppt

Recommended

Recommended

More Related Content

Similar to BRM_Data Analysis, Interpretation and Reporting Part II.ppt

Similar to BRM_Data Analysis, Interpretation and Reporting Part II.ppt (20)

More from AbdifatahAhmedHurre

More from AbdifatahAhmedHurre (7)

Recently uploaded

Recently uploaded (20)

BRM_Data Analysis, Interpretation and Reporting Part II.ppt