1
A PRIMER ON CAUSALITY
Marc F. Bellemare∗
Introduction
This is the second of two handouts written to help students understand quantitative methods in the social
sciences. This handout is dedicated to discussing (some) of the ways in which one can identify causal
relationships in the social sciences. In keeping with the notation introduced in the handout on linear
regression, let 𝐷𝐷 be our variable of interest; 𝑦𝑦 be an outcome of interest; and the vector 𝑥𝑥 = (𝑥𝑥1, … , 𝑥𝑥𝐾𝐾)
represent other factors – or control variables – for which we have data. For the purposes of this discussion,
let 𝐷𝐷 measure a given policy, 𝑦𝑦 measure welfare, and the vector 𝑥𝑥 measure the various control variables the
researcher has seen fit to include. See my “A Primer on Linear Regression” for a more basic handout.
Mechanics
Recall that the regression of 𝑦𝑦 on (𝐷𝐷, 𝑥𝑥1, … , 𝑥𝑥𝐾𝐾) is written as
𝑦𝑦𝑖𝑖 = 𝛼𝛼 + 𝛽𝛽1𝑥𝑥1𝑖𝑖 + ⋯ + 𝛽𝛽𝐾𝐾𝑥𝑥𝐾𝐾𝑖𝑖 + 𝛾𝛾𝐷𝐷𝑖𝑖 + 𝜖𝜖𝑖𝑖, (1)
where i denotes a unit of observation. In the example of wages and education, the unit of observation would
be an individual, but units of observations can be individuals, households, plots, firms, villages, communities,
countries, etc. Just as the research question should drive the choice of what to measure for 𝑦𝑦, 𝐷𝐷, and 𝑥𝑥, the
research question also drives the choice of the relevant unit of observation.
The problem is that unless the researcher runs an experiment in which she randomly assigns the level of 𝐷𝐷 to
each unit of observation i, the relationship from 𝐷𝐷 to 𝑦𝑦 will not be causal. That is, 𝛾𝛾 will not truly capture the
impact of 𝐷𝐷 on 𝑦𝑦, as it will be “contaminated” by the presence of unobservable factors. Some of those factors
can be included in 𝑥𝑥 = (𝑥𝑥1, … , 𝑥𝑥𝐾𝐾), of course, but it is in general impossible to fully control for every relevant
factor. This is especially true when unobservable or costly to observe factors (e.g., risk aversion, technical
ability, soil quality, etc.) play an important role in determining 𝐷𝐷 and 𝑦𝑦. So even if we get an estimate of 𝛾𝛾
that is statistically significant, we cannot necessarily assume that the relationship between the variable of
interest and the outcome variable is causal. In other words, correlation does not imply causation.
For example, suppose 𝐷𝐷 is an individual’s consumption of orange juice and 𝑦𝑦 is (some) indicator of health.
We have often discussed in lecture how a simple regression of 𝑦𝑦 to 𝐷𝐷 would provide us with a biased
estimate of 𝛾𝛾 because orange juice consumption is nonrandom and not exogenous to health. That is, there
are factors other than orange juice consumption which determine health. Some are observable (e.g., how
much someone exercises; whether they smoke; their diet; etc.), but several are unobservable (e.g., their
willingness to pay for orange juice; their subjective valuation of health; th
1. 1
A PRIMER ON CAUSALITY
Marc F. Bellemare∗
Introduction
This is the second of two handouts written to help students
understand quantitative methods in the social
sciences. This handout is dedicated to discussing (some) of the
ways in which one can identify causal
relationships in the social sciences. In keeping with the notation
introduced in the handout on linear
regression, let �� be our variable of interest; �� be an
outcome of interest; and the vector �� = (��1, … , ����)
represent other factors – or control variables – for which we
have data. For the purposes of this discussion,
let �� measure a given policy, �� measure welfare, and the
vector �� measure the various control variables the
researcher has seen fit to include. See my “A Primer on Linear
Regression” for a more basic handout.
Mechanics
Recall that the regression of �� on (��, ��1, … , ����) is
written as
���� = �� + ��1��1�� + ⋯ + ���������� +
������ + ����, (1)
where i denotes a unit of observation. In the example of wages
and education, the unit of observation would
2. be an individual, but units of observations can be individuals,
households, plots, firms, villages, communities,
countries, etc. Just as the research question should drive the
choice of what to measure for ��, ��, and ��, the
research question also drives the choice of the relevant unit of
observation.
The problem is that unless the researcher runs an experiment in
which she randomly assigns the level of �� to
each unit of observation i, the relationship from �� to �� will
not be causal. That is, �� will not truly capture the
impact of �� on ��, as it will be “contaminated” by the
presence of unobservable factors. Some of those factors
can be included in �� = (��1, … , ����), of course, but it is
in general impossible to fully control for every relevant
factor. This is especially true when unobservable or costly to
observe factors (e.g., risk aversion, technical
ability, soil quality, etc.) play an important role in determining
�� and ��. So even if we get an estimate of ��
that is statistically significant, we cannot necessarily assume
that the relationship between the variable of
interest and the outcome variable is causal. In other words,
correlation does not imply causation.
For example, suppose �� is an individual’s consumption of
orange juice and �� is (some) indicator of health.
We have often discussed in lecture how a simple regression of
�� to �� would provide us with a biased
estimate of �� because orange juice consumption is nonrandom
and not exogenous to health. That is, there
are factors other than orange juice consumption which
determine health. Some are observable (e.g., how
much someone exercises; whether they smoke; their diet; etc.),
but several are unobservable (e.g., their
willingness to pay for orange juice; their subjective valuation of
health; their level of risk aversion; their
3. genes; etc.) Thus, it really isn’t sufficient to run a kitchen-sink
regression (i.e., a regression in which
everything observable is thrown in as a control) to properly
identify the causal impact of �� on ��.
∗ Associate Professor, Department of Applied Economics, and
Director, Center for International Food and
Agricultural Policy, University of Minnesota, 1994 Buford Ave,
Saint Paul, MN 55113, [email protected] This is
the August 2017 version of this handout.
mailto:[email protected]
2
Identification
So how do we identify causality? The best way to do so is to
run a randomized controlled trial (RCT), which
we have discussed in lecture. In this case, the idea would be to
get a random sample of individuals of size ��
and to assign half of the sample (i.e., �� 2⁄ ) to a control group
and half to a treatment group. The latter group
would be told to consume, say, one glass of orange juice every
morning, and the other half would be told not
to do so. Then, after a suitable period of time, we would
compare the mean of �� between groups. The null
hypothesis would of course be that the mean health of the
treatment group is equal to the mean health of
the control group. A rejection of the null in favor of finding that
the mean health of the treatment group is
higher than the mean health of the control group would then be
evidence in favor of the hypothesis that
orange juice is good for one’s health. More than that – it would
4. be evidence in favor that orange juice
consumption causes good health.
The problem is that it is not always possible to run an RCT, and
even the simple example described above
would be subject to important problems. For example, the
individuals in the treatment group may not
comply with the experimenters instructions, especially if they
don’t like orange juice. More generally, they
may simply forget to consume orange juice every morning.
Likewise, the individuals in the control group may
end up inadvertently consuming orange juice when they are not
supposed to. These reasons – and others –
would contaminate one’s estimate of �� in equation 1 and
would invalidate the test of equality of means
described above. So what is one to do?
Instrumental Variables Estimation
When one only has observational (i.e., nonexperimental) data at
one’s disposal, the best way to identify
causality is to find an instrumental variable (IV) for the
endogenous variable. In the example above, the
endogenous variable is ��, which is said to be endogenous to
��.
What is an IV? It is a variable �� that is (i) correlated with
��; but (ii) uncorrelated with �� and which is
used to make �� exogenous to ��. How does an IV exogenize
an endogenous variable? By virtue of being
correlated with the endogenous variable, yet uncorrelated with
the error term, which is the definition of
an instrument.
I realize that this sounds tautological, so for example, Angrist
(1990) studies the impact of education (��)
5. on wages (��). The problem is that education is endogenous to
wage, if anything because people acquire
education in expectation of the wage they think this will get
them. In other words, even if we find a
positive coefficient for education in a regression of wage on
explanatory variables, this is merely a
correlation, and it does not necessarily indicate that education
causally affects wages.
To instrument for this, Angrist had to find a variable that would
be correlated with how much education
someone would get, but uncorrelated with anything unobserved
and would affect wage only through
how much education they acquire. The instrument he settled
upon was an individual’s Vietnam draft
lottery number, since this correlates with whether one goes to
war and is then subject to the GI Bill, but
since those numbers are randomly generated, they are
uncorrelated with unobservables.
How does IV estimation work, mechanically speaking? Recall
that our equation of interest is
3
���� = �� + ��1��1�� + ⋯ + ���������� +
������ + ����. (1)
The way IV estimation proceeds is to first regress the
endogenous variable �� on the instrument �� as well as
on the control variables in �� = (��1, … , ����), such that
���� = �� + ��1��1�� + ⋯ + ���������� +
6. ������ + ����. (2)
Once equation 2 is estimated, it is possible to predict the
variable ��, whose prediction we label ��� (the
circumflex accent – or “hat” – denotes a predicted variable in
econometrics) and to then estimate equation 1
as follows
���� = �� + ��1��1�� + ⋯ + ���������� +
������� + ����. (1’)
Note what has been done here: we have replaced the endogenous
variable with an exogenized version of the
same variable. The way it has been exogenized has been by
regressing it on the IV, which is exogenous to the
outcome of interest, and to obtain its predicted value, which we
then use in lieu of the original endogenous
variable.
The first requirement of an instrument – i.e., that it be
correlated with �� – is easily testable: we only need
to check that the coefficient �� in equation 2 is significantly
different enough from zero. The second
requirement of an instrument – i.e., that it only affect the
outcome of interest �� through the treatment
variable – cannot be tested for. Rather, one must make the case
that it is truly exogenous to the outcome of
interest. This is easier said than done in most cases, as some
people have devoted entire careers to finding
good IVs.
References
Angrist, Joshua D. (1990), “Lifetime Earnings and the Vietnam
Era Draft Lottery: Evidence from the Social
Security Administrative Records,” American Economic Review
7. 80(3): 313-336.
IntroductionMechanicsIdentificationInstrumental Variables
EstimationReferences
Therapeutic Communication
1. In the movie, Shutter Island, Dr. Crawley asked Teddy
(Andrew) to explain what happened when
he discovered his deceased children? What therapeutic
communication is this an example of?
a. Closed ended questions
b. Using silence
c. Giving broad opening
d. Open ended questions
Rationale: Opened ended questions is correct because this i s
giving Andrew a chance to answer with
more than a simple yes or no. Using silence is incorrect because
he is not being quiet. Closed ended
questions is incorrect because he cannot answer the question
with just a yes or no answer. Giving broad
8. opening is incorrect because Dr. Crawley did ask Andrew to
pick the topic and express his thoughts.
2. Dr. Crawley told Andrew if he had another episode, he would
be lobotomized. In the ending of
the movie, Teddy called Dr. Sheehan, Chuck. Dr. Sheehan nods
his head at Dr. Crawley giving a signal.
What therapeutic communication is this?
a. Making observations
b. Presenting reality
c. Offering self
d. Accepting
Rationale: Making observations is correct because Dr. Sheehan
observed that Teddy called him Chuck,
verifying that he was having another episode. Presenting reality
is incorrect because he was not trying to
bring him back into reality. Offering self is incorrect because
the doctors are not “interested” into Teddy.
Accepting is incorrect because they are not accepting Teddy
behavior as normal.
3. In the movie, Shutter Island, the staff of Ashecliffe allowed
Andrew to play out the role of Teddy
hoping to cure his conspiracy insanity. This is an example of
what kind of therapeutic communication?
9. a. Accepting
b. Exploring
c. Engaging into fantasy
d. Denial
Rationale: Engaging into fantasy is correct because the staff
played into his fantasy of finding Andrew
Laeddis. Exploring is incorrect because exploring is delving
further into a subject, idea, experience, or
relationship. Denial is incorrect because no one was in denial
about his state of mind. Accepting is
incorrect because they know this act is only to cure his disease.
4. Which question shows an example of the therapeutic
communication placing events in
sequence?
a. “You feel angry when he doesn’t help.”
b. “Are you feeling…”
c. “What could you do to let your anger out harmlessly?”
d. “Will you please tell me more about the situation with all the
details?”
10. Rationale: “Will you please tell me more about the situation
with all the details” is an example of placing
events in sequence because it will tell you more about the
situation and you can piece together when
they happened. The others are incorrect because “You feel
angry when he doesn’t help” is an example
of focusing, “Are you feeling…” is an example of verbalizing
the implied and, “What could you do to let
your anger out harmlessly?” is an example of formulating a plan
of action.
5. Which of the following statements by Andrew Laeddis shows
the therapeutic communication of
understanding?
a. I feel ok.
b. This is a very difficult situation.
c. Yes, I understand that Teddy Daniels does not exist.
d. You are not listening to me.
Rationale: “Yes, I understand that Teddy Daniels does not
exist” is an example of understanding because
it coveys an attitude of receptivity and regard. The other
statements “I feel ok”, “This is a very difficult
situation”, and “You are not listening to me” are incorrect
11. because they are responses of implied
questions.
Defense Mechanism
1. What defense mechanism does Teddy exhibit in the movie
Shutter Island?
a. Sublimination
b. Rationalization
c. Projection
d. Fantasy
Rationale: Teddy is exhibiting fantasy because he is gratifying
frustrated desires by imaginary
achievements. In the movie Teddy is creating his own fantasy
world where he is still a US Marshall, and
he creates his own story of what happened to his wife because
he does not want to face the reality of
what really happened. Sublimination is incorrect because he is
not channeling an unacceptable impulse
in a socially acceptable direction. Rationalization is incorrect
because he is not trying to justify attitudes,
beliefs, or behaviors. Projection is incorrect because he is not
attributing his own unacceptable behavior
unto someone else.
12. NUR 114 Nursing Concept II
TYPES OF QUESTIONS
• Open questions
These are useful in getting another person to speak. They often
begin with the words: What, Why,
When, Who
Sometimes they are statements: “tell me about”, “give me
examples of”.
They can provide you with a good deal of information.
• Closed questions
These are questions that require a yes or no answer and are
useful for checking facts. They should
be used with care - too many closed questions can cause
frustration and shut down conversation.
• Specific questions
These are used to determine facts. For example “How much did
you spend on that”
• Probing questions
13. These check for more detail or clarification. Probing questions
allow you to explore specific areas.
However be careful because they can easily make people feel
they are being interrogated .
• Hypothetical questions
These pose a theoretical situation in the future. For example,
“What would you do if…?’ These
can be used to get others to think of new situations. They can
also be used in interviews to find
out how people might cope with new situations.
• Reflective questions
You can use these to reflect back what you think a speaker has
said, to check understanding. You
can also reflect the speaker’s feelings, which is useful in
dealing with angry or difficult people and
for defusing emotional situations.
• Leading questions.
These are used to gain acceptance of your view – they are not
useful in providing honest views
and opinions. If you say to someone ‘you will be able to cope,
won’t you?’ they may not like to
disagree.
You can use a series of different type of questions to “funnel”
information. This is a way of
structuring information in sequence to explore a topic and to get
to the heart of the issues. You may use
an open question, followed by a probing question, then a
specific question and a reflective question.
14. 1
A PRIMER ON LINEAR REGRESSION
Marc F. Bellemare∗
Introduction
This set of lecture notes was written to allow you to understand
the classical linear regression model, which
is one of the most common tools of statistical analysis in the
social sciences. Among other things, a regression
allows the researcher to estimate the impact of a variable of
interest �� on an outcome of interest �� holding
other included factors �� = (��1, … , ����) constant. For
the purposes of this discussion, let �� measure a given
policy, �� measure welfare, and the vector �� measure the
various control variables the researcher has seen fit
to include.
Example
For example, one might be interested in the impact of
individuals’ years of education �� on their wage �� while
controlling for age, gender, race, state, sector of employment,
etc. in ��. Generally, social science research is
interested in the impact of a specific variable of interest on an
outcome of interest, i.e., in the impact of �� on
��.
Mechanics
The regression of �� on (��, ��1, … , ����) is typically
written as
15. ���� = �� + ��1��1�� + ⋯ + ���������� +
������ + ����, (1)
where i denotes a unit of observation. In the example of wages
and education, the unit of observation would
be an individual, but units of observations can be individuals,
households, plots, firms, villages, communities,
countries, etc. Just as the research question should drive the
choice of what to measure for ��, ��, and ��, the
research question also drives the choice of the relevant unit of
observation.
When estimating equation 1, the researcher will have data on
�� units of observations, so �� = 1, … , ��.
Alternatively, we say that �� is the sample size. For each of
those �� units, the researcher will have data on ��,
��, and ��. In other words, we will ignore the problem of
missing data, as observations with missing data are
usually dropped by most statistical packages.
The role of regression analysis is to estimate the coefficients
(��, ��1, … , ����, ��). To differentiate the “true”
coefficients from coefficient estimates, we will use a circumflex
accent (i.e., a “hat”) to denote estimated
coefficients. Therefore, the estimated (��, ��1, … , ����,
��) will be denoted (���, �̂��1, … , �̂����, ���).
Going back to our interest in estimating the impact of �� on
�� at the margin, this impact is represented in the
context of equation 1 by the parameter ��. Indeed, if you
remember your partial derivatives, the marginal
∗ Associate Professor, Department of Applied Economics, and
Director, Center for International Food and
16. Agricultural Policy, University of Minnesota, 1994 Buford Ave,
Saint Paul, MN 55113, [email protected] This is
the August 2017 version of this handout.
mailto:[email protected]
2
impact of �� on �� at is equal to ����
����
= ��. Moreover, the partial derivative is such that only ��
varies. In other
words, �� measures the impact of a change in �� on ��
holding everything else constant, or ceteris paribus. In
this case, what we mean by “everythi ng else” is limited only to
the factors that are included in the vector �� of
control variables. Whatever is not included among the variables
�� is not held constant by regression analysis.
Indeed, the relationship in equation 1 is not deterministic in the
sense that even if we have data for ��, ��, and
�� and credible parameter estimates (���, �̂��1, … ,
�̂����, ���), we will still not be able to perfectly forecast
for ��. That
is because there are several things about any given problem that
we, as social science researchers, do not
observe and are not privy to. Individuals have intrinsic
motivations that even they may have difficulty
expressing. Individuals make errors. Individuals experience
unforeseen events. There are factors which are
very important in determining �� but which we simply do not
observe.
17. For all these reasons, we add an error term �� at the end of
equation 1. The error term simply represents our
ignorance about the problem. As such, it includes all of the
things that we did not think of including on the
right-hand side of equation 1, as well as all of the things that we
could not include on the right-hand side of
equation 1. The error term �� thus embodies our ignorance
about the relationship between two variables.
So how does a linear regression actually work? To take an
example I know well, suppose we are looking at
only two variables: rice yield (i.e., kg/are), which will be our
outcome of interest �� since it represents
agricultural productivity, and cultivated area (number of ares,
or hundredths of a hectare, or 100 square
meters), which will be our variable of interest ��. Indeed, the
inverse relationship between farm or plot size
and productivity has been a longstanding empirical puzzle in
development microeconomics (Barrett et al.,
2010). So, plotting some data on this question, we get the
following figure.
The scatter plot in figure 1 directly shows that the relationship
between yield and cultivated area is not
deterministic. That is, the relationship between the two
variables is not a straight line, and the fact that the
relationship is scattered indicates that there are other factors
besides cultivated area that contributed to
determining rice productivity.
The role of the regression – and, as we will soon understand, of
the error term – is to linearly approximate as
best as possible the relationship between two variables. In other
words, to do something that looks like the
red line in figure 2.
18. 3
Figure 1. Rice Productivity Scatter
Figure 2. Rice Productivity Scatter and Regression Line
1
2
3
4
5
Y
ie
ld
-2 0 2 4 6
Cultivated Area
1
2
3
4
5
19. -2 0 2 4 6
Cultivated Area
Yield Fitted values
4
Note that we indeed find an inverse relationship between plot
size and productivity, since the regression line
slopes downward, which means that in this context, ��� < 0.
Indeed, running a simple regression of rice yield
on cultivated area yields ��� = 4.187 (and we can see from the
graph that 4.187 would indeed be the value of
rice productivity at a cultivated area of zero ares) and ��� =
−0.356, with both coefficients statistically
different from zero at less than the 1 percent level (i.e., there is
a less than one percent chance �� and �� are
no different from zero). In other words, the finding here is that
on average,
�� = 4.187 − 0.356��, (2)
or for every 1 percent increase in cultivated area, rice
productivity decreases by 0.356 percent. This may
seem counterintuitive, but remember that productivity is not
total output – it is only a measure of average
productivity on the plot.
So how does the apparatus of the linear regression determine
the value of the intercept and the value of the
slope of the regression line in figure 2? This is where our error
term comes into play. Indeed, a linear
20. regression will choose, among all possible lines, the one that
minimizes the sum of the distances between
each point in the scatter and the line itself, under the
assumption that the error is on average equal to zero
(i.e., that our predictions are right on average). Assuming that
the error term is equal to zero on average and
minimizing the sum of all point-line distances (technically, the
sum of squared errors) allows us to obtain
estimates ��� and ��� of the true parameters �� and ��.
A few remarks are in order. First off, note that the constant term
(or the intercept) �� does not have an
economic interpretation in this case, since a cultivated area of
zero really entails a yield of zero. Second, since
the only factor we included on the right-hand side of equation 1
was cultivated area, the error term includes
a lot of things which may be potentially crucial in determining
yield. For example, the plot’s position on the
toposequence, the quality of the soil, the source of irrigation of
the plot, various characteristics of the
household operating the plot, etc. So because we are typically
interested in the impact of �� on �� controlling
for a number of factors ��, regression results will not be
presented in the form of figure 2. Indeed, regression
results are typically presented in the form of table 1 at the end
of this document.
How do we interpret table 1? First off, note that N = 466. That
is, we have data on 466 plots. The first column
tells us what variables are included on the right-hand side of
equation 1, viz. cultivated area; land value; total
land owned by the household; household size (number of
individuals); household dependency ratio
(proportion of dependents within the household); whether the
household head is a single female or a single
male; whether the plot is irrigated by a dam, a spring, or
21. rainfed; soil quality measurements (carbon,
nitrogen, potassium percentages; soil pH; clay, silt, and sand
percentages); and an intercept. The second
column shows the estimated coefficients for the first
specification of equation 1 (in this case, a pooled cross -
section of all the plots and all the households, i.e., a
specification which ignores the fact that some
households own more than just one plot in the sample); and the
third column shows the standard errors
around each estimated coefficient.
These standard errors are used to determine whether each
coefficient is statistically significantly different
from zero or not. To make life simpler, table 1 shows whether
coefficients are significant at the 10, 5, or 1
percent levels by using the symbols *, **, and *** respectively.
Note that in all cases, there is a (significant)
inverse relationship between cultivated area and rice
productivity.
5
Taking column 1 as an example of how to interpret regression
results, what can we say? First and foremost,
note that for a 1 percent increase in cultivated area, there is an
associated productivity decrease of 0.27
percent (alternatively, a doubling of the size of the plot would
be associated with a 27-percent decrease in
productivity). Moreover, we can note three things. First off, the
more valuable a plot, the more productive it
is; second, plots irrigated by a dam are more productive than
plots without any irrigation; and third, plots
irrigated by a spring are more productive than plots without any
22. irrigation. In fact, comparing the magnitude
of the coefficient estimates for irrigation by a dam and
irrigation by a spring, we see that the impacts of these
two types of irrigation are essentially the same.
Another thing of note in table 1 is how the coefficient on the
variable of interest changes depending on what
is included on the right-hand side of equation 1. Comparing the
first two specifications (i.e., pooled cross-
section vs. household fixed effects), note how the magnitude of
the inverse relationship between
productivity and cultivated area is reduced from -0.271 to -
0.176 when household fixed effects (i.e., controls
for household-specific unobservables characteristics, which is
made possible here because there are 286
households for 466 plots; in other words, there are some
households who own more than one plot in the
sample) are included. This indicates that a great deal of the
inverse relationship can be attributed to
household-specific, otherwise unobservable factors. Likewise,
comparing specifications 1 and 3 (i.e., pooled
cross-section vs. soil quality), note again how the magnitude of
the inverse relationship between productivity
and cultivated area is reduced from -0.271 to -0.265 when soil
quality measurements are included. Overall,
this indicates that household-specific, unobserved factors are
more important in driving the inverse
relationship than the omission of soil quality measurements. In
any event, a comparison of specifications 1
and 2 and of specifications 1 and 3 point to an important
endogeneity problem (in this case, an omitted
variables problem) caused by the omission, respectively, of
household fixed effects and of soil quality
measurements.
23. References
Barrett, Christopher B., Marc F. Bellemare, and Janet Y. Hou
(2010), “Reconsidering Conventional
Explanations of the Inverse Productivity—Size Relationship,”
World Development 38(1): 88-97.
6
Table 1 – Yield Approach Estimation Results (n=466)
(1)
Pooled Cross-Section
(2)
Household Fixed Effects
(3)
Soil Quality
(4)
Household Fixed Effects and
Soil Quality
Variable Coefficient (Std. Err.) Coefficient (Std. Err.)
Coefficient (Std. Err.) Coefficient (Std. Err.)
24. Dependent Variable: Rice Yield (Kilograms/Are)
Cultivated Area -0.271*** (0.038) -0.176*** (0.046) -0.265***
(0.048) -0.187*** (0.052)
Total Land Area -0.055 (0.038) -0.054 (0.047)
Land Value 0.183*** (0.031) 0.303*** (0.069) 0.176***
(0.032) 0.287*** (0.063)
Household Characteristics
Household Size -0.007 (0.009) -0.008 (0.008)
Dependency Ratio -0.073 (0.130) -0.083 (0.144)
Single Female -0.056 (0.111) -0.070 (0.119)
Single Male 0.133 (0.134) 0.122 (0.155)
Plot Characteristics
Irrigated by Dam 0.389** (0.171) 0.228 (0.202) 0.450** (0.211)
0.545 (0.402)
Irrigated by Spring 0.365** (0.175) 0.250 (0.214) 0.425**
(0.214) 0.541 (0.389)
Irrigated by Rain 0.184 (0.180) 0.024 (0.217) 0.249 (0.220)
0.313 (0.431)
Soil Quality Measurements
Carbon -1.361 (1.510) -0.001 (1.844)
Nitrogen 1.668 (1.781) -0.007 (2.750)
pH -1.064 (7.163) -17.969 (14.459)
Potassium 1.183 (1.412) -5.528* (3.035)
Clay 0.293 (3.115) -5.183 (4.174)
Silt 0.521 (5.751) 4.485 (11.681)
Sand -0.135 (5.607) 5.261 (7.106)
Intercept -1.847*** (0.372) -3.162*** (0.694) -2.276 (1.552) -
3.328*** (0.819)
Number of Households – 286 – 286
Bootstrap Replications – – 500 500
Village Fixed Effects Yes Dropped Yes Dropped
25. R2 0.45 0.97 0.46 0.97
p-value (All Coefficients) 0.00 0.00 0.00 0.00
p-value (Fixed Effects) – 0.00 – 0.00
p-value (Soil Quality) – – 0.79 0.52
***, ** and * indicate statistical significance at the one, five
and ten percent levels, respectively.
IntroductionExampleMechanicsReferences
Ordinary Least-Squares
Ordinary Least-Squares
Ordinary Least-Squares
Ordinary Least-Squares
Ordinary Least-Squares
One-dimensional regression
x
y
Ordinary Least-Squares
One-dimensional regression
y = ax
26. Find a line that represent the
”best” linear relationship:
x
y
Ordinary Least-Squares
One-dimensional regression
iiie = y - x a
• Problem: the data does not
go through a line
iiy - x a
x
y
Ordinary Least-Squares
One-dimensional regression
iiie = y - x a
• Problem: the data does not
go through a line
27. • Find the line that minimizes
the sum:
i
iiå(y - x a)2
iiy - x a
x
y
Ordinary Least-Squares
One-dimensional regression
x
̂
iiie = y - x a
i
i = å(y -x a )2e(a)
• Problem: the data does not
go through a line
• Find the line that minimizes
the sum:
• We are looking for that
minimizes
i
iiå(y - x a)2
28. iiy - x a
x
y
Ordinary Least-Squares
Multidimentional linear regression
Using a model with m parameters
å=
j
jjmm xay = x a + ...+ x a11
Ordinary Least-Squares
Multidimentional linear regression
Using a model with m parameters
2x
å=
j
jjmm xay = a x + ...+ a x11
1
29. x
y
Ordinary Least-Squares
Multidimentional linear regression
Using a model with m parameters
2a
å=++=
j
jjmm xaxaxab ...11
1a
b
Ordinary Least-Squares
Multidimentional linear regression
Using a model with m parameters
and n measurements
å=+
j
30. jjmm xaxay = a x + ...11
2
2
1
,
2
1 1
, )(
y
(a)
y - Ax=
ú
û
ù
ê
ë
é
-=
-=
å
å å