DIFFERENCE IN MEANS AND
REGRESSIONS WITH BINARY
INDEPENDENT VARIABLES
ECON 355 – Regression Analysis
Ryan Herzog, Ph.D.
IN THIS TOPIC
• Categorical variables
• Mean-comparison tests
• Difference-in-difference(s) estimation technique
CATEGORICAL VARIABLES
• Up until now we have been dealing with continuous variables, i.e. the price of a house or
the size of a house
• Categorical variables are different, they are usually described by a word, not number. We
can deal with them by grouping them.
• For example, if I ask you if you are a cat or a dog person and record the numbers in a
spreadsheet, I will be dealing with words “cat” and “dog” not numbers
CLASS DATA COLLECTION
• Please use the following link to input data about yourself:
- Cat person/dog person
- Coffee consumption
https://docs.google.com/spreadsheets/d/1R7cPm92FeYuaxAQRLINh1_zCLL_VlEZwj_K9CGz
1zPA/edit?usp=sharing
- Do cat and dog lovers consume the same amount of caffeine?
DIFFERENCE IN MEANS
• To test if there is a difference in means between two groups, we need to find the t-stat:
• Find the means and the difference between them
• Divide the difference by the standard error:
𝑠1
2
𝑛1
+
𝑠2
2
𝑛2
• The null hypothesis is that the means of the two groups are equal, i.e. cat and dog
lovers consume the same amount of caffeine
• The alternative is that they do not consume the same amount of caffeine ( the means of
the two groups are different)
BINARY VARIABLES
• To record categorical variables we will use dummies/binary variables: 1 and 0
• For example, if I have two groups that are mutually exclusive, meaning each observation
can only belong to one group but not both at the same time, I will assign “1” to the first
group and “0” to the other.
• For example, cat lovers can be coded as 1, and dog lovers will then be coded as 0.
• Stata: use TeachingRatings.dta dataset
• Which variables are continuous and which ones are categorical?
DIY
• Work in Stata
• We need to find out the difference in means of student-teacher evaluations for male
and female professors
• bys female: summarize course_eval
• Use the formula for t-stat to test if the means between the two groups are equal
REGRESSION WITH A BINARY INDEPENDENT VARIABLE
• When we have a binary variable on the right-hand side we are effectively comparing
means between two groups the included group and the excluded group.
Stata: reg course_eval female
• The interpretation of beta changes
• It is not anymore “when X increases by 1 unit” but rather ”For the included group (what
is it?) the dependent variable on average changes by beta compared to the excluded
group (what is it?)”
LET’S TRY MORE EXAMPLES
• Regress course evaluations on the following binary variables
• Minority (equal to 1 if the professor represents a teaching minority, 0 otherwise)
• One credit (equal to 1 if the course is a 1-credit course, 0 otherwise)
• nnenglish (equal to 1 if the professor’s native language is not English, 0 otherwise)
• Intro (equal to 1 if the course is introductory, 0 otherwise)
• Are the relationships statistically significant?
• If yes, please interpret them
CONDUCTING A MEAN-COMPARISON TEST (T-TEST) IN
STATA
• We can also find the same answer by conducting the t-test analysis in Stata
• Statistics=>summaries, tables, and tests=>classical tests of hypotheses=>t-test (mean-
comparison test)
1. Run a mean-comparison test of teaching evaluations based on the gender
2. Run a mean-comparison test of teaching evaluations based on any other categorical
variable
CREATING A BINARY VARIABLE IN STATA
• Please use “binarydata_stata” dataset
• Library – stands for a family member owning a library card when the respondent was 14
• Urban – the respondent lives in an urban area at 2002 interview
• Government – the respondent works for the government
• To create a binary variable for the “library”
gen libraryd=0
replace libraryd=1 if library=="yes”
• Do earnings of those whose family owned a library card differ from the earnings of
those whose family did not? If yes, by how much?
DIY
• Please convert the government variable into a binary variable
• By conducting a mean-comparison test in Stata please pick the correct interpretation of
the test result
DIFFERENCE-IN-DIFFERENCES ESTIMATION TECHNIQUE
• Allows to show causality
• Needs a treatment and a control group
• We can use difference-in-differences, for example, if there is a new policy implemented
on a local level
• Example: Card and Krueger (2000).
Control group: fast food stores in Eastern Pennsylvania
Treatment group: fast food stores in New Jersey
Treatment: increase in minimum wage in New Jersey on April 1, 1992
Compare employment growth in Eastern Pennsylvania and New Jersey before and after
treatment
PAIRED T-TEST
• The paired t-test is used to determine whether the mean of a dependent variable (e.g.,
weight, anxiety level, salary, reaction time, etc.) is the same in two related groups (e.g.,
two groups of participants that are measured at two different "time points" or who
undergo two different "conditions").
• To understand whether there was a difference in managers' salaries before and after undertaking
a PhD
• Your dependent variable would be "salary", and your two related groups would be the two
different "time points”
• To understand whether there was a difference in smokers' daily cigarette consumption 6 week
after wearing nicotine patches compared with wearing patches that did not contain nicotine,
known as a "placebo"
• Your dependent variable would be "daily cigarette consumption", and your two related groups
would be the two different "conditions" participants were exposed to; that is, cigarette
consumption values after wearing "nicotine patches" (the treatment group) compared to after
wearing the "placebo" (the control group).
• Specifically, you use a paired t-test to determine whether the mean difference between
two groups is statistically significantly different to zero.
DIFF-N-DIFF CONTINUED. ANOTHER EXAMPLE
• Richardson and Troost (2009). Different monetary policies by federal reserve districts
• Mississippi is divided between 6th and 8th federal reserve districts
• During the Great Depression Atlanta Federal Reserve (6th district) increased lending by
30-40% to rescue banks from bankruptcy; St. Louis Fed if anything cut the lending by
10% (laissez faire)
• Treatment group – Mississippi banks in the 6th federal reserve district
• Control group – Mississippi banks in the 8th federal reserve district
• Treatment – monetary policy during Great Depression
MONETARY POLICY DURING GREAT DEPRESSION CONT’D
• Use banks.xlsx
• District 6 and district 8 variables signify the number of banks in each district at a point in
time
• Use filter in excel to find out the number of banks on the first of July each year
• What is the difference in the number of banks between 1929 and 1933 in the 6th district?
• What is the difference in the number of banks between 1929 and 1933 in the 8th district?
• What is the difference of the two differences?
DIFF-N-DIFF WATER SUPPLY AND CHOLERA EXAMPLE
• John Snow (1855) – described the relationship between water supply and cholera death
in London overtime
• In South London both Southwark and Vauxhall Company and Lambert Company drew
water from contaminated Thames in central London in 1849
• In 1852 Lambeth Company started drawing water from an uncontaminated water source
upstream.
• What would we expect to happen in this case?
• What is the control group? Treatment group? Treatment?
DIFF-N-DIFF WATER SUPPLY AND CHOLERA EXAMPLE
CONT’D
• Use Cholera_deaths excel dataset
• To conduct the diff-n-diff analysis here what should you do? What is the conclusion?
• In Stata
• Stata: statistics=>summaries, tables, and tests=>classical tests of hypotheses=>t-test (mean-
comparison test)=>paired test=> by group
• What is the conclusion based on the statistical significance of the test?
REVIEW
1. What is the difference between categorical variables and continuous variables? Please give an example
of each
2. To run a regression with a categorical variable as an independent variable what do we need to do?
3. What is the difference in the interpretation of a regression with a continuous independent variable and a
regression with a categorical independent variable?
4. How do we interpret the constant in a regression with a categorical independent variable? Please give
an example
5. What do we mean by “omitted group” when including a categorical variable in a regression? Please give
an example
6. What does conducting a mean-comparison test allow us to do?
7. How do we conduct a mean-comparison test in Stata? When do we conduct two-sample t-test and
when do we conduct a paired t-test?
8. To be able to conduct a difference in difference analysis what do we need to have?
9. Intuitively, how do we conduct a difference in difference analysis?

Topic 4 (binary)

  • 1.
    DIFFERENCE IN MEANSAND REGRESSIONS WITH BINARY INDEPENDENT VARIABLES ECON 355 – Regression Analysis Ryan Herzog, Ph.D.
  • 2.
    IN THIS TOPIC •Categorical variables • Mean-comparison tests • Difference-in-difference(s) estimation technique
  • 3.
    CATEGORICAL VARIABLES • Upuntil now we have been dealing with continuous variables, i.e. the price of a house or the size of a house • Categorical variables are different, they are usually described by a word, not number. We can deal with them by grouping them. • For example, if I ask you if you are a cat or a dog person and record the numbers in a spreadsheet, I will be dealing with words “cat” and “dog” not numbers
  • 4.
    CLASS DATA COLLECTION •Please use the following link to input data about yourself: - Cat person/dog person - Coffee consumption https://docs.google.com/spreadsheets/d/1R7cPm92FeYuaxAQRLINh1_zCLL_VlEZwj_K9CGz 1zPA/edit?usp=sharing - Do cat and dog lovers consume the same amount of caffeine?
  • 5.
    DIFFERENCE IN MEANS •To test if there is a difference in means between two groups, we need to find the t-stat: • Find the means and the difference between them • Divide the difference by the standard error: 𝑠1 2 𝑛1 + 𝑠2 2 𝑛2 • The null hypothesis is that the means of the two groups are equal, i.e. cat and dog lovers consume the same amount of caffeine • The alternative is that they do not consume the same amount of caffeine ( the means of the two groups are different)
  • 6.
    BINARY VARIABLES • Torecord categorical variables we will use dummies/binary variables: 1 and 0 • For example, if I have two groups that are mutually exclusive, meaning each observation can only belong to one group but not both at the same time, I will assign “1” to the first group and “0” to the other. • For example, cat lovers can be coded as 1, and dog lovers will then be coded as 0. • Stata: use TeachingRatings.dta dataset • Which variables are continuous and which ones are categorical?
  • 7.
    DIY • Work inStata • We need to find out the difference in means of student-teacher evaluations for male and female professors • bys female: summarize course_eval • Use the formula for t-stat to test if the means between the two groups are equal
  • 8.
    REGRESSION WITH ABINARY INDEPENDENT VARIABLE • When we have a binary variable on the right-hand side we are effectively comparing means between two groups the included group and the excluded group. Stata: reg course_eval female • The interpretation of beta changes • It is not anymore “when X increases by 1 unit” but rather ”For the included group (what is it?) the dependent variable on average changes by beta compared to the excluded group (what is it?)”
  • 9.
    LET’S TRY MOREEXAMPLES • Regress course evaluations on the following binary variables • Minority (equal to 1 if the professor represents a teaching minority, 0 otherwise) • One credit (equal to 1 if the course is a 1-credit course, 0 otherwise) • nnenglish (equal to 1 if the professor’s native language is not English, 0 otherwise) • Intro (equal to 1 if the course is introductory, 0 otherwise) • Are the relationships statistically significant? • If yes, please interpret them
  • 10.
    CONDUCTING A MEAN-COMPARISONTEST (T-TEST) IN STATA • We can also find the same answer by conducting the t-test analysis in Stata • Statistics=>summaries, tables, and tests=>classical tests of hypotheses=>t-test (mean- comparison test) 1. Run a mean-comparison test of teaching evaluations based on the gender 2. Run a mean-comparison test of teaching evaluations based on any other categorical variable
  • 11.
    CREATING A BINARYVARIABLE IN STATA • Please use “binarydata_stata” dataset • Library – stands for a family member owning a library card when the respondent was 14 • Urban – the respondent lives in an urban area at 2002 interview • Government – the respondent works for the government • To create a binary variable for the “library” gen libraryd=0 replace libraryd=1 if library=="yes” • Do earnings of those whose family owned a library card differ from the earnings of those whose family did not? If yes, by how much?
  • 12.
    DIY • Please convertthe government variable into a binary variable • By conducting a mean-comparison test in Stata please pick the correct interpretation of the test result
  • 13.
    DIFFERENCE-IN-DIFFERENCES ESTIMATION TECHNIQUE •Allows to show causality • Needs a treatment and a control group • We can use difference-in-differences, for example, if there is a new policy implemented on a local level • Example: Card and Krueger (2000). Control group: fast food stores in Eastern Pennsylvania Treatment group: fast food stores in New Jersey Treatment: increase in minimum wage in New Jersey on April 1, 1992 Compare employment growth in Eastern Pennsylvania and New Jersey before and after treatment
  • 14.
    PAIRED T-TEST • Thepaired t-test is used to determine whether the mean of a dependent variable (e.g., weight, anxiety level, salary, reaction time, etc.) is the same in two related groups (e.g., two groups of participants that are measured at two different "time points" or who undergo two different "conditions"). • To understand whether there was a difference in managers' salaries before and after undertaking a PhD • Your dependent variable would be "salary", and your two related groups would be the two different "time points” • To understand whether there was a difference in smokers' daily cigarette consumption 6 week after wearing nicotine patches compared with wearing patches that did not contain nicotine, known as a "placebo" • Your dependent variable would be "daily cigarette consumption", and your two related groups would be the two different "conditions" participants were exposed to; that is, cigarette consumption values after wearing "nicotine patches" (the treatment group) compared to after wearing the "placebo" (the control group). • Specifically, you use a paired t-test to determine whether the mean difference between two groups is statistically significantly different to zero.
  • 15.
    DIFF-N-DIFF CONTINUED. ANOTHEREXAMPLE • Richardson and Troost (2009). Different monetary policies by federal reserve districts • Mississippi is divided between 6th and 8th federal reserve districts • During the Great Depression Atlanta Federal Reserve (6th district) increased lending by 30-40% to rescue banks from bankruptcy; St. Louis Fed if anything cut the lending by 10% (laissez faire) • Treatment group – Mississippi banks in the 6th federal reserve district • Control group – Mississippi banks in the 8th federal reserve district • Treatment – monetary policy during Great Depression
  • 16.
    MONETARY POLICY DURINGGREAT DEPRESSION CONT’D • Use banks.xlsx • District 6 and district 8 variables signify the number of banks in each district at a point in time • Use filter in excel to find out the number of banks on the first of July each year • What is the difference in the number of banks between 1929 and 1933 in the 6th district? • What is the difference in the number of banks between 1929 and 1933 in the 8th district? • What is the difference of the two differences?
  • 17.
    DIFF-N-DIFF WATER SUPPLYAND CHOLERA EXAMPLE • John Snow (1855) – described the relationship between water supply and cholera death in London overtime • In South London both Southwark and Vauxhall Company and Lambert Company drew water from contaminated Thames in central London in 1849 • In 1852 Lambeth Company started drawing water from an uncontaminated water source upstream. • What would we expect to happen in this case? • What is the control group? Treatment group? Treatment?
  • 18.
    DIFF-N-DIFF WATER SUPPLYAND CHOLERA EXAMPLE CONT’D • Use Cholera_deaths excel dataset • To conduct the diff-n-diff analysis here what should you do? What is the conclusion? • In Stata • Stata: statistics=>summaries, tables, and tests=>classical tests of hypotheses=>t-test (mean- comparison test)=>paired test=> by group • What is the conclusion based on the statistical significance of the test?
  • 19.
    REVIEW 1. What isthe difference between categorical variables and continuous variables? Please give an example of each 2. To run a regression with a categorical variable as an independent variable what do we need to do? 3. What is the difference in the interpretation of a regression with a continuous independent variable and a regression with a categorical independent variable? 4. How do we interpret the constant in a regression with a categorical independent variable? Please give an example 5. What do we mean by “omitted group” when including a categorical variable in a regression? Please give an example 6. What does conducting a mean-comparison test allow us to do? 7. How do we conduct a mean-comparison test in Stata? When do we conduct two-sample t-test and when do we conduct a paired t-test? 8. To be able to conduct a difference in difference analysis what do we need to have? 9. Intuitively, how do we conduct a difference in difference analysis?