WELCOME
1
What is Regression Analysis ?
• Technique of estimating the unknown value of
dependent variable from the known value of
independent variable is called regression analysis.
Eg : The effect of a price increase upon demand, or
the effect of changes in the money supply upon the
inflation rate
2
Regression Lines
A regression line is a line that best describes the
linear relationship between the two variables.
y = a + bx
3
a
Y=a+bX
Y=a-bX
X Axis
Y Axis
Assumptions for regression
Measurement :
• All independent variables –interval/ratio/dichotomous
• Dependent variable- interval/ratio
Specification :
• Linear relationship between dependent and
independent
Expected value of error term : zero
4
Homoscedasticity
• Variance of error term is same/ constant
Normality of error
• Normally distributed for each set of values of
independent variable
Absence of multicollinearity
Assumptions for regression
5
Limitations of linear regression
Violation of measurements
• Dependent variable : if it is dichotomous
eg.: Smoker and non-smoker
Adopter and non-adopter
Participating and non-participating
• Independent variable: if any of the IV is dichotomous
• Eg: male and female
6
Shall we use LPM ???...
yes but…
• Non-normality of the errors Ui
•Hetroscedastic variances of the errors
• Non fulfillment of 0 < E (Yi|Xi) < 1
•Questionable of value of R2 as a measure
of goodness of fit
7
What is the way out ???
8
Presentation
on
Logit, Probit and tobit Model
Rabeesh Kumar Verma
Roll no : 10756
Division of Agricultural Extension,ICAR- IARI
9
What is logistic regression ?
Used to analyze relationships between a dichotomous
dependent variable and metric or dichotomous
independent variables
Combines the independent variables to estimate the
probability that a particular event will occur or not
 LR is a nonlinear regression model that forces the output
(predicted values) to be either 0 or 1
It could be called a qualitative response/discrete choice
model in the terminology of economics
10
Assumptions:
• NO NEED a linear relationship between the dependent and
independent variables
• NO NEED- Homoscedasticity of independent variables
• The error terms need to be independent
• It requires quite large sample sizes
• Absence of perfect multicollinearity
NO NEED -normality, linearity, and homogeneity of variance for the
independent variables
11
12
-∞ +∞0
1
P
CDF
Feature of logit model:
• As P goes from 0 to 1 the logit L goes from -∞ to +∞.
That is although probabilities lie between 0 to 1,logits are
not so bounded.
• L is linear in X, the probabilities themselves are not,
which is in contrast with LPM model where probabilities
increases linearly with X.
• If L, the logit is positive, it means that when the value of
the regressor (s) increases the odds that the regressand
equals to 1 increases and vice versa.
13
Level of measurement:
• Logistic regression analysis requires that the dependent
variable be dichotomous.
• Logistic regression analysis requires that the
independent variables be metric or dichotomous.
• If an independent variable is nominal level and not
dichotomous, the logistic regression procedure in SPSS
has a option to dummy code the variable.
• If an independent variable is ordinal, we will attach the
usual caution.
14
Variables in logistic regression:
• In typical logistic regression analysis there will always be
one dependent (dichotomous) and
• Usually set of independent variable that may be either
dichotomous or quantitative or some combination .
15
The minimum number of cases per independent
variable is 10
(Hosmer and Lemeshow , Applied Logistic Regression )
For example-
If we are using 8 independent variables, then
minimum sample size should be = 8 x 10= 80
Sample size: Logit model and Probit
model
16
Logistics regression equation
Ln (Pi / (1-Pi)= a + b1x1 +b2x2+….+bnXn
Where,
Pi= probability of happening of event
eg: adoption of technology
(1-Pi) = probability of not happening of the event
eg: non-adoption of technology
X1, X2….Xn= independent variables
b1, b2…bn= regression coefficients
a= constant (intercept)
17
Example :
Dependent variable Adoption / Non-adoption
Independent variables Description Hypothesized relation
Age Chronological years of
farmers
-
Education No of years of formal
schooling
+
Land holding Farm size measured acres +
Access to training Yes=1 / no=0 +
Distance to market In kilometers -
Access to credit Yes=1 / no=0 +
Extension services Yes=1 / no=0 +
18
Logit in SPSS
19
Logit in SPSS contd…
20
Logit model in stata:
Logit :predicted possibilities.
logit: Odd ratio
Case 1:
A Logit Analysis of Bt Cotton Adoption and Assessment
of Farmers’ Training Need
Padaria, et al., 200924
Contd…
Padaria, et al.,
2009
B = regression coefficient
Used to predict whether or not an independent variable would be significant in the
model.
degrees of freedom for the Wald chi- square test,
Are the standard errors associated with the coefficients
Wald chi square value and 2tailed p value used in testing the null hypothesis that the
coefficient (parameter) is 0
Exp(B) the exponentiation of the B coefficient, which is an odds ratio. This value is
given by default because odds ratios can be easier to interpret than the coefficient
25
Advantages of logit model:
Transformation of a dependent dichotomous
dependent variable into continuous variable
Results - easily interpretable
simple to analyse method.
It gives parameter estimates- asymptotically
consistent, efficient and normal, so that the analogue
by the regression t-test can be applied.
26
Limitation:
• As in case of logit probility model, the disturbance term in
logit model hetroscedasticity and therefore we should go
for weighted least squares.
• As in many other regression , there may be problem of
multicollinearity if the explanatory variable are related
among themselves
27
Application of logit model:
1.It can be used to identify the factors that affects the adoption of
particular technology say, use of new crop varities, fertilizers,
pesticides etc on the farm.
2.Model used extensively to analyzing growth phenomena such as
population, GNP, money supply etc.
3.In field of marketing it can be used for brand preferences and brand
loyalty for a brand
4.Gender studies can be used logit analysis to find out factors which
affect the decision making status of men and women in family
28
Probit regression model:
• Probit model is a type of regression where the dependent
variable can only take two values, for example adoption or
non-adoption, married or not married.
• The purpose of the model is to estimate the probability
• Estimating model that emerge from normal cumulative
distribution function (CDF) is popularly known as probit
model
• Sometimes it is also called as normit model.
29
Probit :Level of measurement
requirements
• Dependent variable = dichotomous/categorical
• Eg: adoption and non adoption,
participation and non- participation
• Independent variables be metric or dichotomous
• Eg: age-ratio level data
• Gender- male/female(dichotomous)
30
Case 2 : Factors Affecting Adoption of Improved Rice Varieties
among Rural Farm Households in Central Nepal
Ghimire (2015 ) (Published in : Rice Science)
31
Probit result cont…
32
Difference b/w
Logit and Probit model:
Logit Probit
Slightly flatter tails The conditional probability Pi
approaches 0 or 1 at a faster rate
Basis of logit model is standard
logistic distribution
Basis of probit model is standard
normal distribution
Variance = Π2 / 3 Variance = 1
Simple mathematics Sophisticated mathematics
Both give same result, preference of the method depends on the researcher choice
but logit regression is mostly preferered
33
Significance of Wald test
•To test Statistical significance of unique
contribution of each coefficient in the
model
•This test is similar to the t test in the
multiple regression
34
Ordinal logit & probit model
• In both the cases -
• when the outcome is more than 2 and are ordinal in nature
• The dependent variables :
• Eg1: Likert type scale : strongly agree , somewhat agree, strongly
disagree
• Eg2: less than high school (0), high school(1), college (2), post
graduate (3)
• The independent variables remain same as in logit and probit
model
35
Multi nominal logit and multi
nominal probit
• When the dependent variable is not ordinal nature &
the categories of dependent variables are more than 2.
• E.g. 1: adoption of different adaptation strategies
Dependent variables =choice of transportation to work
Eg2: occupation classification : unskilled, semiskilled, highly
skilled
36
Multi nominal logit model
Kassie et al. 200837
Dependent variable : compost , conservation tillage, both
we have three categories i.e. > 2 categories
Tobit model
• An extension of probit model.
• Developed by James Tobin (Nobel laurate
economist)
• Used when a sample in which information on the
regressand is available only for some observation.
• Such sampled are called as censored sample.
• Therefore Tobit model is also know as censored
regression model.
• Sometimes also called as limited dependent variable
regression models
38
Conti..
• Example:
• Suppose we have a set of consumer and we are
interested in finding out the amount of money a
person or family spends on a house in relation to
socioeconomic variables.
• Here we have a dilemma …
• If a consumer does not purchase a house, obviously
we have no data on housing expenditure for such
consumers, we have such data only for the
consumers; who actually purchase a house.
39
• Thus, consumers are divided into two groups consisting
of say n1 and n2
• n1-about whom we have information on the regressor
(say income, no.of people, mortagage interest rate ) as
well as regressand (amount of expenditure on house )
• n2- about whom we have information only on the
regressor but not on the regressand.
• Now questions arise ?
40
• Can we estimates regression using only n1
observation and not worry about the remaining n2
observation.
• The answer is no..
• For the OLS estimates of the parameters obtained
from the subset of n1 observation will be biased as
well as incosistent.
41
• Statistically we can express tobit model as
•
• Yi= β1+ β 2Xi+Ui if RHS>0
• = 0
• Where RHS=right hand side
• Note : additional X variables can be easily added to
the model.
42
• Truncated sample :
• Distinguish from censored sample.
• In truncated sample information on the regreessor
(IV) is available only if the regressand(DV) is
observed.
43
• If we estimate a regression line based on the n1
observation only, the resulting intercept and slope
coefficients are bound to be different than if all the
(n1+n2) observation were taken into account .
44
Mechanics of estimating tobit
model:
• Tobit model are estimated by method of maximum
likelihood .
• James Hackman has proposed alternative to ML
which is comparatively easy.
• The Heckman procedure yields consistent estimates
of the parameters but they are not as efficient as the
ML estimates.
45
Nested regression analysis
• A nested model is one in which you incrementally
add variables such that every subsequent model is a
superset of the preceding one.
• For example, if y = a + bx is the first model, then
the second model would be something like y = a +
bx + cz +....
• The advantage of this set-up is that it allows you to
compare different specifications and ultimately
investigate the relative importance of specific
variables.
46
• Note that a model is nested if and only if the next
model contains the exact same terms in the
preceding one and has at least one additional term.
• On the other hand, a two-stage model is one in
which two equations are estimated one after the
other with the second stage equation including a
predicted value (usually the predicted outcome or
residuals) from the first stage equation
47
Conclusion
• Clear on – About the assumption of different
regression analysis model.
• Researcher should be well aware of the different model
and used according to the defined research problem.
• Logit and probit model are being extensively used in
health science, behavioral and social sciences.
• Models are extensively used in social research when
dependent variable is dichotomous.
48
References :
• Meyers,L.S ., Gamst , G., & Guarino , A.J
(2006).Applied Multivariate Research : Design And
Interpretation
• Padaria et.al (2009). A Logit Analysis Of Bt Cotton
Adoption And Assessment of Farmer’s Training Need.
Indian Res.J.Ext.Edu.9(2)
• Damodar et al . (2012). Basic econometrics. Mcgraw
Hill Education , India
49
Thank you…
50

Logit and Probit and Tobit model: Basic Introduction

  • 1.
  • 2.
    What is RegressionAnalysis ? • Technique of estimating the unknown value of dependent variable from the known value of independent variable is called regression analysis. Eg : The effect of a price increase upon demand, or the effect of changes in the money supply upon the inflation rate 2
  • 3.
    Regression Lines A regressionline is a line that best describes the linear relationship between the two variables. y = a + bx 3 a Y=a+bX Y=a-bX X Axis Y Axis
  • 4.
    Assumptions for regression Measurement: • All independent variables –interval/ratio/dichotomous • Dependent variable- interval/ratio Specification : • Linear relationship between dependent and independent Expected value of error term : zero 4
  • 5.
    Homoscedasticity • Variance oferror term is same/ constant Normality of error • Normally distributed for each set of values of independent variable Absence of multicollinearity Assumptions for regression 5
  • 6.
    Limitations of linearregression Violation of measurements • Dependent variable : if it is dichotomous eg.: Smoker and non-smoker Adopter and non-adopter Participating and non-participating • Independent variable: if any of the IV is dichotomous • Eg: male and female 6
  • 7.
    Shall we useLPM ???... yes but… • Non-normality of the errors Ui •Hetroscedastic variances of the errors • Non fulfillment of 0 < E (Yi|Xi) < 1 •Questionable of value of R2 as a measure of goodness of fit 7
  • 8.
    What is theway out ??? 8
  • 9.
    Presentation on Logit, Probit andtobit Model Rabeesh Kumar Verma Roll no : 10756 Division of Agricultural Extension,ICAR- IARI 9
  • 10.
    What is logisticregression ? Used to analyze relationships between a dichotomous dependent variable and metric or dichotomous independent variables Combines the independent variables to estimate the probability that a particular event will occur or not  LR is a nonlinear regression model that forces the output (predicted values) to be either 0 or 1 It could be called a qualitative response/discrete choice model in the terminology of economics 10
  • 11.
    Assumptions: • NO NEEDa linear relationship between the dependent and independent variables • NO NEED- Homoscedasticity of independent variables • The error terms need to be independent • It requires quite large sample sizes • Absence of perfect multicollinearity NO NEED -normality, linearity, and homogeneity of variance for the independent variables 11
  • 12.
  • 13.
    Feature of logitmodel: • As P goes from 0 to 1 the logit L goes from -∞ to +∞. That is although probabilities lie between 0 to 1,logits are not so bounded. • L is linear in X, the probabilities themselves are not, which is in contrast with LPM model where probabilities increases linearly with X. • If L, the logit is positive, it means that when the value of the regressor (s) increases the odds that the regressand equals to 1 increases and vice versa. 13
  • 14.
    Level of measurement: •Logistic regression analysis requires that the dependent variable be dichotomous. • Logistic regression analysis requires that the independent variables be metric or dichotomous. • If an independent variable is nominal level and not dichotomous, the logistic regression procedure in SPSS has a option to dummy code the variable. • If an independent variable is ordinal, we will attach the usual caution. 14
  • 15.
    Variables in logisticregression: • In typical logistic regression analysis there will always be one dependent (dichotomous) and • Usually set of independent variable that may be either dichotomous or quantitative or some combination . 15
  • 16.
    The minimum numberof cases per independent variable is 10 (Hosmer and Lemeshow , Applied Logistic Regression ) For example- If we are using 8 independent variables, then minimum sample size should be = 8 x 10= 80 Sample size: Logit model and Probit model 16
  • 17.
    Logistics regression equation Ln(Pi / (1-Pi)= a + b1x1 +b2x2+….+bnXn Where, Pi= probability of happening of event eg: adoption of technology (1-Pi) = probability of not happening of the event eg: non-adoption of technology X1, X2….Xn= independent variables b1, b2…bn= regression coefficients a= constant (intercept) 17
  • 18.
    Example : Dependent variableAdoption / Non-adoption Independent variables Description Hypothesized relation Age Chronological years of farmers - Education No of years of formal schooling + Land holding Farm size measured acres + Access to training Yes=1 / no=0 + Distance to market In kilometers - Access to credit Yes=1 / no=0 + Extension services Yes=1 / no=0 + 18
  • 19.
  • 20.
    Logit in SPSScontd… 20
  • 21.
  • 22.
  • 23.
  • 24.
    Case 1: A LogitAnalysis of Bt Cotton Adoption and Assessment of Farmers’ Training Need Padaria, et al., 200924
  • 25.
    Contd… Padaria, et al., 2009 B= regression coefficient Used to predict whether or not an independent variable would be significant in the model. degrees of freedom for the Wald chi- square test, Are the standard errors associated with the coefficients Wald chi square value and 2tailed p value used in testing the null hypothesis that the coefficient (parameter) is 0 Exp(B) the exponentiation of the B coefficient, which is an odds ratio. This value is given by default because odds ratios can be easier to interpret than the coefficient 25
  • 26.
    Advantages of logitmodel: Transformation of a dependent dichotomous dependent variable into continuous variable Results - easily interpretable simple to analyse method. It gives parameter estimates- asymptotically consistent, efficient and normal, so that the analogue by the regression t-test can be applied. 26
  • 27.
    Limitation: • As incase of logit probility model, the disturbance term in logit model hetroscedasticity and therefore we should go for weighted least squares. • As in many other regression , there may be problem of multicollinearity if the explanatory variable are related among themselves 27
  • 28.
    Application of logitmodel: 1.It can be used to identify the factors that affects the adoption of particular technology say, use of new crop varities, fertilizers, pesticides etc on the farm. 2.Model used extensively to analyzing growth phenomena such as population, GNP, money supply etc. 3.In field of marketing it can be used for brand preferences and brand loyalty for a brand 4.Gender studies can be used logit analysis to find out factors which affect the decision making status of men and women in family 28
  • 29.
    Probit regression model: •Probit model is a type of regression where the dependent variable can only take two values, for example adoption or non-adoption, married or not married. • The purpose of the model is to estimate the probability • Estimating model that emerge from normal cumulative distribution function (CDF) is popularly known as probit model • Sometimes it is also called as normit model. 29
  • 30.
    Probit :Level ofmeasurement requirements • Dependent variable = dichotomous/categorical • Eg: adoption and non adoption, participation and non- participation • Independent variables be metric or dichotomous • Eg: age-ratio level data • Gender- male/female(dichotomous) 30
  • 31.
    Case 2 :Factors Affecting Adoption of Improved Rice Varieties among Rural Farm Households in Central Nepal Ghimire (2015 ) (Published in : Rice Science) 31
  • 32.
  • 33.
    Difference b/w Logit andProbit model: Logit Probit Slightly flatter tails The conditional probability Pi approaches 0 or 1 at a faster rate Basis of logit model is standard logistic distribution Basis of probit model is standard normal distribution Variance = Π2 / 3 Variance = 1 Simple mathematics Sophisticated mathematics Both give same result, preference of the method depends on the researcher choice but logit regression is mostly preferered 33
  • 34.
    Significance of Waldtest •To test Statistical significance of unique contribution of each coefficient in the model •This test is similar to the t test in the multiple regression 34
  • 35.
    Ordinal logit &probit model • In both the cases - • when the outcome is more than 2 and are ordinal in nature • The dependent variables : • Eg1: Likert type scale : strongly agree , somewhat agree, strongly disagree • Eg2: less than high school (0), high school(1), college (2), post graduate (3) • The independent variables remain same as in logit and probit model 35
  • 36.
    Multi nominal logitand multi nominal probit • When the dependent variable is not ordinal nature & the categories of dependent variables are more than 2. • E.g. 1: adoption of different adaptation strategies Dependent variables =choice of transportation to work Eg2: occupation classification : unskilled, semiskilled, highly skilled 36
  • 37.
    Multi nominal logitmodel Kassie et al. 200837 Dependent variable : compost , conservation tillage, both we have three categories i.e. > 2 categories
  • 38.
    Tobit model • Anextension of probit model. • Developed by James Tobin (Nobel laurate economist) • Used when a sample in which information on the regressand is available only for some observation. • Such sampled are called as censored sample. • Therefore Tobit model is also know as censored regression model. • Sometimes also called as limited dependent variable regression models 38
  • 39.
    Conti.. • Example: • Supposewe have a set of consumer and we are interested in finding out the amount of money a person or family spends on a house in relation to socioeconomic variables. • Here we have a dilemma … • If a consumer does not purchase a house, obviously we have no data on housing expenditure for such consumers, we have such data only for the consumers; who actually purchase a house. 39
  • 40.
    • Thus, consumersare divided into two groups consisting of say n1 and n2 • n1-about whom we have information on the regressor (say income, no.of people, mortagage interest rate ) as well as regressand (amount of expenditure on house ) • n2- about whom we have information only on the regressor but not on the regressand. • Now questions arise ? 40
  • 41.
    • Can weestimates regression using only n1 observation and not worry about the remaining n2 observation. • The answer is no.. • For the OLS estimates of the parameters obtained from the subset of n1 observation will be biased as well as incosistent. 41
  • 42.
    • Statistically wecan express tobit model as • • Yi= β1+ β 2Xi+Ui if RHS>0 • = 0 • Where RHS=right hand side • Note : additional X variables can be easily added to the model. 42
  • 43.
    • Truncated sample: • Distinguish from censored sample. • In truncated sample information on the regreessor (IV) is available only if the regressand(DV) is observed. 43
  • 44.
    • If weestimate a regression line based on the n1 observation only, the resulting intercept and slope coefficients are bound to be different than if all the (n1+n2) observation were taken into account . 44
  • 45.
    Mechanics of estimatingtobit model: • Tobit model are estimated by method of maximum likelihood . • James Hackman has proposed alternative to ML which is comparatively easy. • The Heckman procedure yields consistent estimates of the parameters but they are not as efficient as the ML estimates. 45
  • 46.
    Nested regression analysis •A nested model is one in which you incrementally add variables such that every subsequent model is a superset of the preceding one. • For example, if y = a + bx is the first model, then the second model would be something like y = a + bx + cz +.... • The advantage of this set-up is that it allows you to compare different specifications and ultimately investigate the relative importance of specific variables. 46
  • 47.
    • Note thata model is nested if and only if the next model contains the exact same terms in the preceding one and has at least one additional term. • On the other hand, a two-stage model is one in which two equations are estimated one after the other with the second stage equation including a predicted value (usually the predicted outcome or residuals) from the first stage equation 47
  • 48.
    Conclusion • Clear on– About the assumption of different regression analysis model. • Researcher should be well aware of the different model and used according to the defined research problem. • Logit and probit model are being extensively used in health science, behavioral and social sciences. • Models are extensively used in social research when dependent variable is dichotomous. 48
  • 49.
    References : • Meyers,L.S., Gamst , G., & Guarino , A.J (2006).Applied Multivariate Research : Design And Interpretation • Padaria et.al (2009). A Logit Analysis Of Bt Cotton Adoption And Assessment of Farmer’s Training Need. Indian Res.J.Ext.Edu.9(2) • Damodar et al . (2012). Basic econometrics. Mcgraw Hill Education , India 49
  • 50.