4.7 Running a regression

Regression is a flexible tool to model the relationship between the dependent and independent variables. Consider the data coming from the following simulated DGP.

set.seed(1254)
n  <- 2000
female  <- rbinom(n, 1, 0.5)
educ  <- sample(seq(0, 16, 1), n, replace = TRUE)
income  <- 20000 + (4000 * educ) -(2500 * female) + (500 * female * educ) + rnorm(n, mean = 0, sd = 2500)

dat  <- data.frame(educ = educ, income = income, female = female)

# 
head(dat)
##   educ   income female
## 1   15 80649.27      0
## 2   13 68828.91      0
## 3    8 56338.62      1
## 4   16 84110.95      0
## 5    5 38053.19      0
## 6    0 17776.13      1

Say, you want to investigate whether the relationship between education and earnings varies by gender. More specifically, you want to evaluate whether increase in years of education has differential returns on earnings for females compared to males. How do you do this?

You’d want to set up your null and alternative hypothesis and test your alternative hypothesis under the null.

Null hypothesis. Returns to education on earnings does not vary by gender.

Alternative hypothesis. Returns to education are different for female compared to male.

Let’s first start with a simple regression specification.

\[\begin{equation} earnings_i = \alpha + \beta education_i + \epsilon_i \end{equation}\]

This specification needs to be modified in order to account for gender.

\[\begin{equation} earnings_i = \alpha + \beta_1 education_i + \beta_2 gender + \epsilon_i \end{equation}\]

here, \(gender\) is a binary variable, taking the value \(0\) if male and \(1\) if female. The coefficient on \(\beta_1\) evaluates the effect of 1 additional year of education on earnings, and the coefficient on \(\beta_2\) evaluates the change in average earnings among females compared to males. However, this specification still does not model our alternative hypothesis: Returns to education are lower for female compared to male. To test this, we need to incorporate an interaction term between education and gender.

\[\begin{equation} earnings_i = \alpha + \beta_1 education_i + \beta_2 gender + \beta_3 gender \times education + \epsilon_i \end{equation}\]

Let’s break down what the coefficients are capturing:

  • \(\alpha\): captures the average earnings for males with 0 education value.

  • \(\beta_1\): captures the effect of one additional year of education on earnings for male.

  • \(\beta_2\): captures the effect of being a female on average earnings compared to male.

  • \(\beta_3\): tells us whether the impact of an additional year of education for female is different than for female. To see this: the expected returns to education for:

    1. male = \(\alpha + \beta_1 E(education_i)\).

    2. female: \(\alpha + E(education_i) \times (\beta_1 + \beta_3) + \beta_2\).

Let’s run the specification that tests our alternative hypothesis.

reg  <- lm(income ~ educ +  female +  female*educ, dat)
summary(reg)
## 
## Call:
## lm(formula = income ~ educ + female + female * educ, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8806.8 -1793.6     0.5  1760.2  9360.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20066.81     158.72  126.43   <2e-16 ***
## educ         3984.76      16.77  237.61   <2e-16 ***
## female      -2647.20     219.10  -12.08   <2e-16 ***
## educ:female   512.65      23.35   21.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2561 on 1996 degrees of freedom
## Multiple R-squared:  0.9852, Adjusted R-squared:  0.9852 
## F-statistic: 4.436e+04 on 3 and 1996 DF,  p-value: < 2.2e-16

We see that a year of education leads to an increase in earnings by 3,984. On average, a female’s earning is less than that of male’s by 2,647. Also, on average, the effect of an additional year of education among females is 512 more than that of male’s. Note that these estimates are not too different from the true parameters used to generate the simulated DGP.