Lab 3: Set Working Directory, Scatterplots and Introduction to
Linear Regression
Chao-yo Cheng
[email protected]
Zsuzsanna Magyar
[email protected]
January 16, 2016
1 Section objectives
In this section we will use the HW2.dta. This dataset is a small set of variables from the larger
“Maddison dataset” 1 By the end, you should be comfortable using commands to import
your .dta file, making (somewhat) fancy scatterplots, and running () regressions.
2 Commands
In this lab, you should become familiar with the following commands.
cd
use
regress
and
twoway scatter
lfit
scheme ()
3 Set working directory and quickly running a .do file
• Log in and open Stata.
• Log in the class website. Save the Homework 2 data in “My Documents” (or any folder
that works for you).
• Open a .do file. First set the working directory on Stata. Type:
cd "insert path address"
To get the address of your path, right click on “My documents” and press “Copy address”
so you can copy this inside your “” after cd.
1See here for more information: http://www.ggdc.net/maddison/maddison-project/home.htm.
1
• To import the data type the command use. Type:
use HW2.dta
• Check whether or not the data has been imported properly.
• Now you know how to open your data using code. This means you can quickly run your
.do file on a clean dataset using the clear all command at the top of the .do file.
This will save you the trouble of opening a fresh dataset once your do file is finished.
• To sum up, at the top of your .do file, type
clear all
cd "path address"
use HW2.dta
Question 1. How many variables are in the dataset, and how many observations are there?
4 Scatterplots
• Everything after the “,” in a graphical command is an option. The variables being
graphed come before the comma.
• Sometimes it is nice to use a scheme for your scatterplot so it looks simpler. Here we
use the scheme (s1mono). Schemes determine the overall look of a graph.
• To draw scatterplots with observation labels and titles for the y and x axis. Type
twoway (scatter gdppc_2000 gdppc_1500, mlabel(country)), ///
scheme (s1mono) ///
ytitle("GDP per capita 2000") ///
xtitle("GDP per capita 1500") ///
title("Scatterplot of GDP per capita 1500 versus 2000")
• Add a line of best fit using lfit command. Type:
twoway (scatter gdppc_2000 gdppc_1500, mlabel(country)) (lfit gdppc_2000 gdppc_
1500, color(blue)), ///
scheme (s1mono) ///
ytitle("GDP per capita 2000") ///
xtitle("GDP per capita 1500") ///
title("Scatterplot of GDP per capita 1500 versus 2000")
Question 2. What relationship does the slope of the fitted line indicate?
5 Linear regression
• The dependent variable (or outcome variable) is what we are trying to explain. It is
also called the “outcome” or Y .
2
• The explanatory (or independent variables) are what we use to do the explaining. These
variables are also called predictors, as we think they are trying to predict the dependent
variable.
• The command fo ...
Lab 3 Set Working Directory, Scatterplots and Introduction to.docx
1. Lab 3: Set Working Directory, Scatterplots and Introduction to
Linear Regression
Chao-yo Cheng
[email protected]
Zsuzsanna Magyar
[email protected]
January 16, 2016
1 Section objectives
In this section we will use the HW2.dta. This dataset is a small
set of variables from the larger
“Maddison dataset” 1 By the end, you should be comfortable
using commands to import
your .dta file, making (somewhat) fancy scatterplots, and
running () regressions.
2 Commands
In this lab, you should become familiar with the following
commands.
cd
use
2. regress
and
twoway scatter
lfit
scheme ()
3 Set working directory and quickly running a .do file
• Log in and open Stata.
• Log in the class website. Save the Homework 2 data in “My
Documents” (or any folder
that works for you).
• Open a .do file. First set the working directory on Stata. Type:
cd "insert path address"
To get the address of your path, right click on “My documents”
and press “Copy address”
so you can copy this inside your “” after cd.
1See here for more information:
http://www.ggdc.net/maddison/maddison-project/home.htm.
1
• To import the data type the command use. Type:
use HW2.dta
3. • Check whether or not the data has been imported properly.
• Now you know how to open your data using code. This means
you can quickly run your
.do file on a clean dataset using the clear all command at the top
of the .do file.
This will save you the trouble of opening a fresh dataset once
your do file is finished.
• To sum up, at the top of your .do file, type
clear all
cd "path address"
use HW2.dta
Question 1. How many variables are in the dataset, and how
many observations are there?
4 Scatterplots
• Everything after the “,” in a graphical command is an option.
The variables being
graphed come before the comma.
• Sometimes it is nice to use a scheme for your scatterplot so it
looks simpler. Here we
use the scheme (s1mono). Schemes determine the overall look
of a graph.
• To draw scatterplots with observation labels and titles for the
y and x axis. Type
twoway (scatter gdppc_2000 gdppc_1500, mlabel(country)), ///
scheme (s1mono) ///
4. ytitle("GDP per capita 2000") ///
xtitle("GDP per capita 1500") ///
title("Scatterplot of GDP per capita 1500 versus 2000")
• Add a line of best fit using lfit command. Type:
twoway (scatter gdppc_2000 gdppc_1500, mlabel(country)) (lfit
gdppc_2000 gdppc_
1500, color(blue)), ///
scheme (s1mono) ///
ytitle("GDP per capita 2000") ///
xtitle("GDP per capita 1500") ///
title("Scatterplot of GDP per capita 1500 versus 2000")
Question 2. What relationship does the slope of the fitted line
indicate?
5 Linear regression
• The dependent variable (or outcome variable) is what we are
trying to explain. It is
also called the “outcome” or Y .
2
• The explanatory (or independent variables) are what we use to
do the explaining. These
5. variables are also called predictors, as we think they are trying
to predict the dependent
variable.
• The command for a regression is regress.
• Now let’s run a regression. Note that your dependent variable
always comes
before the independent variable. You should remember what
units you are working
in in order to help you interpret your results. Type:
regress gdppc_2000 gdppc_1500
Question 3. What is the outcome in this regression? What
variable are we using to predict
the outcome? How do we interpret the coefficient?
6 Summary statistics by continent
• In Homework 1, you calculated the mean for each region by
typing three lines of code.
A quicker way to do this is to use the tabstat command followed
by by.
• Generate a single variable for continent. The straight vertical
line means “or.” Type
gen continent = .
replace continent = 1 if africa == 1
replace continent = 2 if east_asia == 1 | west_asia ==1 |
cent_asia ==1
6. replace continent = 3 if latin_america ==1
replace continent = 4 if west_europe ==1 | east_europe ==1
replace continent = 5 if country == "United States" | country ==
"Canada"
replace continent = 6 if country == "Australia" | country ==
"New Zealand"
• Lets see what we have.
tab continent
• Now we label the new variable.
label define contname 1 "Africa" 2 "Asia" 3 "Latin America" 4
"Europe" 5 "North
America" 6 "Australia"
label values continent contname
• Suppose we want the mean population for each continent.
tabstat pop_1820, statistics(mean) by(continent)
3
Section objectivesCommandsSet working directory and quickly
running a .do fileScatterplotsLinear regressionSummary
statistics by continent
Lab 3: Set Working Directory, Scatterplots and Introduction to
Linear Regression
7. Chao-yo Cheng
[email protected]
Zsuzsanna Magyar
[email protected]
January 16, 2016
1 Section objectives
In this section we will use the HW2.dta. This dataset is a small
set of variables from the larger
“Maddison dataset” 1 By the end, you should be comfortable
using commands to import
your .dta file, making (somewhat) fancy scatterplots, and
running () regressions.
2 Commands
In this lab, you should become familiar with the following
commands.
cd
use
regress
and
twoway scatter
lfit
8. scheme ()
3 Set working directory and quickly running a .do file
• Log in and open Stata.
• Log in the class website. Save the Homework 2 data in “My
Documents” (or any folder
that works for you).
• Open a .do file. First set the working directory on Stata. Type:
cd "insert path address"
To get the address of your path, right click on “My documents”
and press “Copy address”
so you can copy this inside your “” after cd.
1See here for more information:
http://www.ggdc.net/maddison/maddison-project/home.htm.
1
• To import the data type the command use. Type:
use HW2.dta
• Check whether or not the data has been imported properly.
• Now you know how to open your data using code. This means
you can quickly run your
.do file on a clean dataset using the clear all command at the top
of the .do file.
This will save you the trouble of opening a fresh dataset once
9. your do file is finished.
• To sum up, at the top of your .do file, type
clear all
cd "path address"
use HW2.dta
Question 1. How many variables are in the dataset, and how
many observations are there?
4 Scatterplots
• Everything after the “,” in a graphical command is an option.
The variables being
graphed come before the comma.
• Sometimes it is nice to use a scheme for your scatterplot so it
looks simpler. Here we
use the scheme (s1mono). Schemes determine the overall look
of a graph.
• To draw scatterplots with observation labels and titles for the
y and x axis. Type
twoway (scatter gdppc_2000 gdppc_1500, mlabel(country)), ///
scheme (s1mono) ///
ytitle("GDP per capita 2000") ///
xtitle("GDP per capita 1500") ///
title("Scatterplot of GDP per capita 1500 versus 2000")
• Add a line of best fit using lfit command. Type:
10. twoway (scatter gdppc_2000 gdppc_1500, mlabel(country)) (lfit
gdppc_2000 gdppc_
1500, color(blue)), ///
scheme (s1mono) ///
ytitle("GDP per capita 2000") ///
xtitle("GDP per capita 1500") ///
title("Scatterplot of GDP per capita 1500 versus 2000")
Question 2. What relationship does the slope of the fitted line
indicate?
5 Linear regression
• The dependent variable (or outcome variable) is what we are
trying to explain. It is
also called the “outcome” or Y .
2
• The explanatory (or independent variables) are what we use to
do the explaining. These
variables are also called predictors, as we think they are trying
to predict the dependent
variable.
• The command for a regression is regress.
• Now let’s run a regression. Note that your dependent variable
11. always comes
before the independent variable. You should remember what
units you are working
in in order to help you interpret your results. Type:
regress gdppc_2000 gdppc_1500
Question 3. What is the outcome in this regression? What
variable are we using to predict
the outcome? How do we interpret the coefficient?
6 Summary statistics by continent
• In Homework 1, you calculated the mean for each region by
typing three lines of code.
A quicker way to do this is to use the tabstat command followed
by by.
• Generate a single variable for continent. The straight vertical
line means “or.” Type
gen continent = .
replace continent = 1 if africa == 1
replace continent = 2 if east_asia == 1 | west_asia ==1 |
cent_asia ==1
replace continent = 3 if latin_america ==1
replace continent = 4 if west_europe ==1 | east_europe ==1
replace continent = 5 if country == "United States" | country ==
"Canada"
12. replace continent = 6 if country == "Australia" | country ==
"New Zealand"
• Lets see what we have.
tab continent
• Now we label the new variable.
label define contname 1 "Africa" 2 "Asia" 3 "Latin America" 4
"Europe" 5 "North
America" 6 "Australia"
label values continent contname
• Suppose we want the mean population for each continent.
tabstat pop_1820, statistics(mean) by(continent)
3
Section objectivesCommandsSet working directory and quickly
running a .do fileScatterplotsLinear regressionSummary
statistics by continent
All tables and figures should be numbered and fully titled, e.g.
Figure 1: Histogram of average per capita income, 1975 and
2009.
All work must be concise and presented clearly. Write in
complete and correct English sentences. Do not just give a
number as an answer. Students lose points for unclear or
incomplete presentation of data and findings.
If the person reading and grading your assignment cannot
understand what you are trying to say, they cannot give you full
credit for your ideas.
For full credit, please note which Stata command you used to
obtain each part of the answer to each question
One example how to interpret the slope (beta) coefficient is the
following :
13. On average, a (unit) increase in X (your independent variable)
will be associated with SLOPE (unit) (increase/decrease) in Y
(your dependent variable).
Please fill in the terms in parentheses and in capital letters.
Of course maybe you learned a slightly different language in
another class and you are welcome to use that language. But do
not forget to use the units of the variables and by all means
include a version of the language "on average..”
do not forget that the intercept (alpha or constant) is the the
average of Y (or the predicted Y) when X is 0.
Furthermore if you are comfortable with it you can provide a
substantive interpretation of the coefficients, you can interpret
the standard errors and the R^2-red.
I know some of you may have forgotten some of this but we
will have a review of the OLS model in this week’s section and
of course you can look through your notes from your intro to
stats classes as well.
14. The goal of this assignment is to describe the historical growth
of population and income around the world. All tables and
figures should be numbered and titled, e.g. Figure 1: Scatterplot
of GDP per capita in 1500 and 2000.
I. The first set of questions can be answered using the dataset
hw2.dta.
1.Compare the wealth of countries in 1600 to the wealth of
countries in 2001.
Are the wealthy countries in 1600 the same as the wealthy
countries in 2001?
In order to explore this question, first generate and report a
scatter plot by country of gdp per capita in 1600 and gdp per
capita in 2001. [Hint: Label the values on the scatter plot with
country names using the mlabel() option.]
To investigate the strength of the relationship, report a
regression with gdp per capita in 2001 as the dependent variable
and gdp per capita 1600 as the regressor.
Interpret the results.
2. Report the mean gdp per capita in 1820 and 2001 by world
region. [Hint: Use the "tabstat" command with the by() option.]
What is the ratio of western Europe’s per capita gdp to the
global average per capita gdp in 1820? And In 2001?
What is western Europe’s income ratio in those years?
Which regions were the poorest in 1820 and in 2001?
15. 3. Compare the populations of countries in 1600 to their
populations in 2001.
Report a graph and regression that demonstrates the
relationship.
How strong is the relationship? [Hint: You may want to create
new variables that are the logarithms of populations in the two
periods to improve the graphic relationship. The Stata command
log() takes the log of a value.]
4. What is the relationship of population to wealth in 1600?
[Hint: This question is not asking you about the relations to per
capita income.] In 2001?
Generate a graphic and a statistical comparison in both periods
and summarize the comparative relationship. [Hint: For
graphical purposes, you again may want to use logarithmic
values.]
Create a third graph and regression to answer the following
question: Have countries with larger populations in 1600 done
better over the subsequent 401 years than countries which
started with small populations?
II. The final question can be answered using the dataset
korea.dta.
5. Sort the data by year and report a table with the population
and gdp per capita of South and North Korea for the dates that
have data for both countries.
16. Create a scatter plot using the following command:
twoway (line gdppc year if country=="North Korea",
clcolor(green) clpat(dash) legend(label(1 "North Korea")
label(2 "South Korea"))) (line gdppc year if country=="South
Korea", clcolor(red) clpat(dot) clwidth(medthick))
How would you interpret the data?
What do they say about the importance of government
institutions for economic growth?
What alternative explanations can you think of? [Hint: Korea
was a single country for most of the time period of this data set,
wasoccupied by Japan beginning in 1905, and split into North
(communist) and South(capitalist) at the end of WWII in 1945.
See the CIA factbook for more
information:https://www.cia.gov/library/publications/the-world-
factbook/geos/kn.html.]