Customer Service Analytics - Make Sense of All Your Data.pptx
Introduction to Limited Dependent variable
1. Limited Dependent Variable
--------------------------------------------------------------------------------------------------------------------
Introduction:
So far, we know how to handle linear estimation models of the type:
Y = β0 + β 1*X 1 + β 2*X 2 + … + ε ≡ Xβ + ε
Where in sometimes we had to transform or add variables to get the equation to be linear:
o Taking logs of Y and/or the X’s
o Adding squared terms
o Adding interactions
Then we can run our estimation, do model checking, visualize results, etc. In all these models Y, the
dependent variable, was continuous. Independent variables could be dichotomous (dummy
variables), but not the dependent variable.
However there are models wherein the dependent variable outcome is binary (takes only two
responses) or has limited responses i.e. the value of response variable is limited within a range. There
are lots of times when the dependent variable of interest takes on one of two values. For example,
suppose you are studying car purchase decisions. You survey consumers to find out if they have
purchased a car in the past year. Your observations can be coded 1 (if the 2 consumer bought a car)
or 0 (if the consumer did not). When your dependent variable can take on one of two values, then
you have a dichotomous dependent variable and the resulting models are sometimes called 0/1
models. Such models fall under limited dependent variables.
We consider models of limited dependent variables in which the economic agent’s response is limited
in some way. The dependent variable, rather than being continuous on the real line (or half–line), is
restricted. In some cases, we are dealing with discrete choice: the response variable may be
restricted to a Boolean or binary choice, indicating that a particular course of action was or was not
selected. In others, it may take on only integer values, such as the number of children per family, or
the ordered values on a Likert scale. Alternatively, it may appear to be a continuous variable with a
number of responses at a threshold value. For instance, the response to the question “how many
hours did you work last week?” will be recorded as zero for the nonworking respondents. None of
2. these measures are amenable to being modeled by the linear regression methods that we know.
Hence to handle such cases we use Limited Dependent variable models.
Definition: A Limited Dependent Variable, Y, is defined as a dependent variable whose range is
substantively restricted.
Yi *= β’Xi + Ui, where Yi* is a latent variable Ui ~N (0, 2)
Yi * =
The common cases are:
o binary: Y ϵ {0, 1}
o multinomial: Y ϵ {0, 1, 2, ..., k}
o integer: Y ϵ {0, 1, 2, ...}
o censored: Y ϵ {𝑌∗
: 𝑌∗
≥0 }
Categorical and limited dependent variables include dependent variables that are binary,
multicategorical, ordinal, counted, censored, or from truncated populations (Long, 1997). Binary
variables have two categories used to indicate that an event has occurred or that some characteristic
is present (for example, placement in foster care coded as "yes" or "no"). Multicatcgorical variables
have three or more unordered categories (for example, type of foster care placement coded as "kin
care," "non-kin family care," "group home care," or "institutional care"). Ordinal variables have
{
𝑌𝑖
∗
if 𝑌𝑖
∗
> 0
0 if 𝑌𝑖
∗
≤ 0
3. ranked categories with unknown distances between adjacent categories (for example, restrictiveness
of foster care placement rated on a five-point ordered scale). Count variables indicate the number of
characteristics or events during a given period (for example, number of foster care placements while
in state custody).
Based on the above cases, limited dependent variables can be classified into two categories as below
Methods used to model LDV:
There are many models of LDVs based on what the limitations are. Some of them are :
o 0-1 dependent variables (dummies) by Probit and Logit
o Ordered dependent variables by ordered Probit and Logit
o Categorical dependent variables (with more than two categories) by multinomial logit
o Truncated dependent variables by Heckman’s procedure
o Censored dependent variables by Tobit
o Count (integer) dependent variables by Poisson regression
o Hazard (length) dependent variables by hazard models
Limited Dependent Variable
Discrete dependent variable. Continuous dependent variable
Discrete Choice Models
(Logit /Probit models)
Truncated/Censored Regression
Models
4. Why LDV Models Are Different
To correctly analyze and interpret any LDV model, it is important to understand two fundamental
differences between LDV and OLS type models.
First, LDV models are intrinsically nonlinear, which means the relationship to be estimated cannot
be written as a summation of terms, where each term is a model coefficient times a model variable.
The intrinsic nonlinearity of LDV models has two major methodological ramifications. First, an
explanatory variable’s marginal effect –– the effect of a unit change in an explanatory variable on the
dependent variable –– does not equal the variable’s model coefficient. Second, the value of this
marginal effect varies over the value of all model variables. These facts imply that one cannot infer
the nature of the true relationship between an explanatory variable and the dependent variable
based solely on the estimated coefficient in a LDV model.
The second fundamental difference is that most LDV models are estimated using the method of
maximum likelihood which, unlike the method of least squares, is not based on minimizing error
variance. This means there is no measure of model “fit” directly comparable to the R-square in OLS
and, as a result, model assessment is largely restricted to testing the joint significance of all model
variables as is done in OLS using an F-test of overall model significance.
Model Estimation:
Basic Problems with OLS estimation for LDV is
o Heteroscedastic in error terms
o Predictions not constrained to match actual outcomes, real problem with actual outcomes,
real problem with predicted values being negative when a negative number isn’t possible
Although it is possible to estimate LDV models with OLS the model is likely to produce point
predictions outside the unit interval [0,1]. We could arbitrarily constrain them to either 0 or 1, but
this linear probability model has other problems: the error term cannot satisfy the assumption of
homoscedasticity. For a given set of X values, there are only two possible values for the disturbance:
−Xβ and (1 − Xβ): the disturbance follows a Binomial distribution. Given the properties of the
Binomial distribution, the variance of the disturbance process, conditioned on X, is Var (u|X) = Xβ (1
− Xβ).
5. There is no constraint to ensure that this quantity will be positive for arbitrary X values. Therefore,
it will rarely be productive to utilize regression with a binary response variable; we must follow a
different strategy.
Because of the limited ranges of the dependent variable, the standard additive normal error is not
tenable for these models. Instead we must model the probability of various discrete outcomes. LDV
models are usually estimated by maximum likelihood, given the assumed distribution of the
conditional probabilities of various outcomes.
Continuous limited dependent variable:
We turn now to a context where the response variable is not binary nor necessarily integer, but
continuous and subject to truncation. We must fully understand the context in which the data were
generated. Nevertheless, it is quite important that we identify situations of truncated or censored
response variables. Utilizing these variables as the dependent variable in a regression equation
without consideration of these qualities will be misleading.
Truncated Regression
In Truncated Regression model the dependent variable is truncated at a certain point. Observations
with the values of Y above a threshold are not included in the sample. The sample is drawn from a
subset of the population so that only certain values are included in the sample. A subset of
observations are dropped, thus, only the truncated data are available for the regression.
Definition: Case where
),( ii xy
is observed only when
ayi
(left truncation) or when
byi
(right
truncation) or when
dyc i
(double truncation). Example: Let iy
= the profit of the i-th firm as
a percentage of assets and ix
= the four firm concentration ratio of industry the firm is in. Suppose
only firms with positive profit rates are observed and firms with negative profit rates are not
observed. In this case a= 0 and we have a problem where the dependent variable is left truncated.
iii xy 10 Where
0iy
for Ni ,,1 and
),0( 2
Ni
Neglecting the truncation can lead to biased estimates of 0 and 1 .
Truncated regression models are used for data where whole observations are missing so that the
values for the dependent and the independent variables are unknown
6. EXAMPLES:
1. A study of the determinants of incomes of the poor. Only households with income below a certain
poverty line are part of the sample.
2. Suppose we have a sample of AIEEE rejects-those who scored below the 30th percentile. We wish
to estimate an IQ equation: AIEEE=f (education, age, socio economic characteristics etc. ). We will
need to take into account the fact that dependent variable is truncated.
Censored Regression
Censored variables are those whose value is known over some range but unknown beyond a certain
value because they only were recorded or collected (that is, censored) as being at or beyond that
value. For example, exact income might be recorded for those with income less than $100,000.
However, for those with income greater than or equal to $100,000, exact income might be recorded
as "greater than or equal to $100,000" rather than an exact amount. So, exact values for income are
missing above $100,000, but the potential range of income is known. Typically censoring is used to
elicit more reliable data or, as with income, to ensure confidentiality
In the example given income is "censored from above." Censoring can occur from above ("right
censored," sometimes referred to as a "ceiling effect"), below ("left censored," sometimes referred to
as a "floor effect"), or both. Censoring can occur with continuous, ordinal, or count dependent
variables
7. Censoring might not be apparent until after data are collected and an examination of the distribution
of the dependent variable suggests an upper or lower value at which observations are clustered. For
example, the frequency of enjoyable time spent by study participants with their children in the past
30 days might be measured using a six-point scale, ranging from never to almost every day, and
results might indicate a large cluster of observations at the upper end of the scale. If such clustering
is due to the way the data were collected or recorded, it is reasonable to treat the variable as
censored. For example, the use of "almost every day" as the upper-scale anchor is analogous to using
an upper-income category of greater than or equal to $100,000 because, conceivably, an upper
category such as "every day" might have been used.
Definition: Consider the sample of size n , where Y* is recorded only for those values of Y* greater
than a constant C . For those values of Y*<= C we record the value of C. That is the observations are
Yi * = i=1, 2, 3……. n
The resulting sample y1, y2,…..,yn is said to be a censored sample .
The censored regression model is defined as
Yi * =
Where β is a kx1 vector if unknown parameters and Xi is a kx1 vector of known constants and Ui are
the disturbance that are independently and normally distributed with mean zero and variance 2.
This model was first studied by Tobin (1958) and is also called Tobit model
{
𝑌𝑖
∗
if 𝑌𝑖
∗
> C
𝐶 if 𝑌𝑖
∗
≤ C
{
β’Xi + Ui if 𝑅𝐻𝑆 > 𝑜
𝑜 otherwise