2. SOME LOGISTICS
• Verify current directory path
• In Stata type pwd to show current directory.
• Use cd “path” to change directory
3. NEW STATA FUNCTIONS AND OPTIONS
• Preserve/restore – lets you preserve and go back to the sample you are working with
before you make any changes with the data
• Drop/keep – lets you keep drop/keep certain observations/variables
• Example:
• Work with realestate dataset
• preserve
• drop age
• restore
4. NEW STATA FUNCTIONS AND OPTIONS CONT’D
• Another example:
• Preserve
• hist price
• keep if age >100
• hist price
• Restore
5. HEDONIC PRICING
• We are going to discuss how the size of the house affects the relationship between its
price and its age
• What is the relationship between the price of the house and its age in general?
• Are all the houses in our sample the same size? Let’s look at its descriptive statistics and
histogram.
• We are going to divide our data sample into 5 groups depending on the size of the
house (under 1000 sqft, 1000-2000 sqft, 2000-3000 sqft, 3000-4000 sqft, 4000-5000
sqft) and see if the relationship between price and age changes for any of these groups.
6. THE SIZE OF THE HOUSE MATTERS!
• Not only is the size of the house related to its price but also most likely related to its
age. Intuitively, why do you think the size and the age of a house might be related?
• In general, we will want to include in the regression everything that possibly affects Y
and is correlated to X
• Do you think the number of bedrooms and bathrooms can also be related to the age of
the house and potentially affect its price?
• If so, we should probably include them in the regression too. What is the relationship
between the price of the house and its age now?
7. IN GENERAL
• Population regression will now look like this
• 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋1𝑖 + 𝛽2 𝑋2𝑖 + 𝛽3…𝑘−1
𝑋3 … 𝑘−1 𝑖 + 𝛽 𝑘 𝑋𝐾𝑖 + 𝑢𝑖
• The interpretation of betas slightly changes. Since there are more than one independent
variable included, when interpreting the beta on one of them, the others are held
constant.
• i.e. 𝛽1 =
∆𝑌
∆𝑋1
holding everything else constant (or ceteris paribus).
• With one unit change in 𝑋1, 𝑌 will change by 𝛽1 holding everything else constant
8. BACK TO THE HEDONIC PRICING EXAMPLE
• Before we interpret the betas in our multiple regression lets figure out the measurement
units for each variable
• Price, beds, baths, age, sqft
• What does the population regression look like when we regress price of a house on its
age, square feet, number beds, and number baths?
• What does the fitted regression look like?
• Please interpret each of the betas except the constant.
9. STATA – CREATING TABLES
• To be able to compare the results of different regressions with ease we usually create tables.
• You can see an example of a table on blackboard. We are going to try to replicate the table
Stata
• ssc install outreg2
• Each regression has to be added to the table separately
• Stata command: outreg2 using tablename.doc
• Every new column has to be added to the already existing table
• Stata: outreg2 using tablename.doc
• To start the document with the same name over:
• Stata: outreg2 using tablename.doc, replace
10. DIY TIME
• Please run the following regressions and create a table with the results of the regression
• Regress price of a house on its age
• Regress price of a house on its age and size
• Regress price of a house on its age, size, number of bedrooms and number of
bathrooms
• Please make sure the table looks clean and professional.
11. T-TEST IN A MULTIPLE REGRESSION
• The significance tests do not change between single and multiple regressions
• Coefficients are still significant at
• 1% if t-stat >|2.58| and p-value<0.01
• 5% if t-stat>|1.65| and p-value<0.05
• 10% if t-stat >|1.96| and p-value<0.1
12. MORE DIY TIME
• Please use the caschool dataset
• Please run four regressions of test score on class size (1) and control for total
enrollment(2); expenditure per student and average income (3); average income and
computers per students (4)
• We will not edit the table in the word file, we will rather look at the regression results in
stata
• Please interpret one of the betas in your regressions
13. IMPERFECT MULTICOLLINEARITY
• If we include variables in a regression that are closely related to one another the betas
on them will become statistically insignificant (because the standard errors will increase)
• Example:
• regress test scores on calworks percentage
• then regress test scores on percent qualifying for reduced-price lunch,
• then regress test scores on percent qualifying for reduced-price lunch and percent qualifying for
calworks.
• What happens to the significance of the betas?
• Sometimes a few variables combined together may be correlated with a variable already
included, we may never know.
14. IMPERFECT MULTICOLLINEARITY WHAT TO DO AND WHAT
NOT TO DO
• Do not run kitchen-sink regressions
• Concentrate on a variable of interest, the rest should be “controlled for”. Be deliberate
about the variables you add to the regression. Start with a baseline regression and then
add more one by one or by group.
• If multicollinearity exists in your results (and it most likely does), you are erring on the
conservative side. This means you are not claiming that the relationship exists when it
does not, much rather the opposite.
• Example: if we want to test the relationship between test scores and average income
what other variables should we control for?
15. PERFECT MULTICOLLINEARITY
• Happens when your regressors are perfectly correlated
• Use teaching ratings data set
• Create a variable equal to 1 if professor is a male
• Stata:
• generate male=0
• replace male=1 if female==0
• Regress course evaluations on the male and female dummy variables in the same
regression
• What happens? Why do you think it happens?
• This is called a dummy variable trap – we have included a dummy for each category.
Stata will correct for it, other software will not. Remember to always omit one category
and compare the betas on the included categories to the omitted category
16. PERFECT MULTICOLLINEARITY
EXAMPLE
• Use binarydata dataset
• There are a couple of ways to create dummy variables in stata and include them into a
regression
• Variable “ethnicity” contains three possible outcomes in this dataset “black”, “Hispanic”,
“white”. We can create a dummy variable for each, it will be equal to 1 if a person is
Hispanic, and 0 otherwise.
• Stata:
• tabulate ethnicity, generate(e)
• Let’s look at the three variables Stata created
• Now regress earnings on the three variables that control for ethnicity. What happens?
Why?
• Please interpret the coefficients in the above regression.
17. PERFECT MULTICOLLINEARITY
EXAMPLE, DIY.
1. Please create a set of dummy variables for the following variable: hsdropout, i. e. a
variable equal to 1 for those who dropped out of high school and 0 otherwise, and a
variable equal to 1 who did not drop out of high school and 0 otherwise
• Now regress EARNINGS on one of the dummy variables. Please interpret the results of
the regression. 2. Please create a set of dummy variables for the variable relationship
status
• Now regress EARNINGS on the group of the dummy variables omitting one of them.
• Please interpret the results of the regression (use slido to pick the correct answer)
18. MULTIPLE REGRESSION, DIY TIME.
• Please use EAEF22 dataset to show the relationship between one’s earnings and amount
of schooling, while controlling for other variables.
• Please run a few regressions to determine which empirical model explains the
relationship between earnings and schooling best.
• Use the knowledge you have received in this topic to decide which variables to include
in your regressions.
• Please interpret the relationship that you found.
19. REVIEW
• Why do we need to include more than one regressor in a regression?
• How is a t-test conducted in a multiple regression?
• How do you create a table with the results of a regression in Stata?
• What is imperfect collinearity? Is it a problem? Should we try to avoid it?
• What is perfect multicollinearity? Is it a problem? Should we try to avoid it?
• How do you interpret results of a regression with a set of dummy variables?
Editor's Notes
In general if we regress the price on age we will find out that there is no relationship. The relationship between price and age of the house is the following:
Under 1000: 2.06
1000-2000: 2.3***
2000-3000: 3.6***
3000-4000: 2.18
4000-5000: 1.52
Show how to use “reg y x” for the first two groups, then divide the class into three groups and ask to do it for the last three groups
Why do you think there are no subscripts “i” on betas?
Price – thousands of dollars, beds – number, baths – number, age – years, sqft – square feet