4.1 The best-fit line
Let me first start with an illustration. Say, we want to understand the relationship between education and income. Let’s first simulate the data and plot the relationship.
<- 1000
n <- sample(seq(1, 16, 1), n, replace = TRUE)
educ <- 20000 + 2000 * educ + rnorm(n, mean = 10000, sd = 5000)
income
<- data.frame(educ = educ, income = income)
dat
# mean income for each value of education
<- dat %>%
dat_sum group_by(educ) %>%
summarize(mean_income = mean(income))
# merge
<- dat %>%
dat merge(dat_sum, by = "educ", all.x = T)
<- ggplot(dat, aes(x = educ, y = income)) + geom_point() +
f0 geom_line(aes(x = educ, y = mean_income), size = 1) +
xlab("years of education") + ggtitle("Panel A") +
annotate("text", x = 10, y = 35000,
label = "Line plotting the conditional mean", color = "blue", hjust = 0)
<- ggplot(dat, aes(x = educ, y = income)) + geom_point() +
f1 geom_smooth(method = "lm", se = FALSE, color = "blue") + xlab("years of education") +
annotate("text", x = 10, y = 35000, label = "Best-Fit Line; E(Y|X)", color = "blue", hjust = 0) +
ggtitle("Panel B")
/ f1 f0
## `geom_smooth()` using formula = 'y ~ x'
Each point on the figure pertains to an individual. Panel A uses the raw points, while Panel B adds in the best-fit line. This is also known as the regression line.