4.1 The best-fit line

Let me first start with an illustration. Say, we want to understand the relationship between education and income. Let’s first simulate the data and plot the relationship.

n  <- 1000
educ  <- sample(seq(1, 16, 1), n, replace = TRUE)
income  <- 20000 + 2000 * educ + rnorm(n, mean = 10000, sd = 5000)

dat  <- data.frame(educ = educ, income = income)

# mean income for each value of education
dat_sum  <- dat  %>% 
                group_by(educ)  %>% 
                summarize(mean_income = mean(income))

# merge 
dat  <- dat  %>% 
            merge(dat_sum, by = "educ", all.x = T)

f0  <- ggplot(dat, aes(x = educ, y = income)) + geom_point() + 
            geom_line(aes(x = educ, y = mean_income), size = 1) +
        xlab("years of education") + ggtitle("Panel A") + 
        annotate("text", x = 10, y = 35000, 
                label = "Line plotting the conditional mean", color = "blue", hjust = 0)

f1  <- ggplot(dat, aes(x = educ, y = income)) + geom_point() + 
geom_smooth(method = "lm", se = FALSE, color  = "blue") + xlab("years of education") + 
annotate("text", x = 10, y = 35000, label = "Best-Fit Line; E(Y|X)", color = "blue", hjust = 0) + 
ggtitle("Panel B")

f0 / f1
## `geom_smooth()` using formula = 'y ~ x'

Each point on the figure pertains to an individual. Panel A uses the raw points, while Panel B adds in the best-fit line. This is also known as the regression line.