4.2 Linear Regression Specification

Let’s write down the relationship between years of education and earnings using a linear simple (univariate) regression model.

\[\begin{equation} Y_{i} = \alpha + \beta X_{i} + \epsilon_{i} \tag{4.1} \end{equation}\]

where,

\(Y_{i}\) is income for an individual \(i\),
\(X_i\) is years of education,
\(\alpha\) is the y-intercept,
\(\beta\) is the slope,
\(\epsilon_i\) is the error term.

We’ll observe \(Y\) and \(X\). \(\epsilon_{i}\), the error term, is an unobserved random variable. The error term can be written as: \(\epsilon_{i} = Y_i - \alpha - \beta X_{i}\).

Note that the regression specification is a population concept. For example, if you could have everyone in the population in your data set, then \(\beta\) is the population coefficient. The error is again a population concept. However, (almost always) you don’t observe the population; you have to work with the sample. Using a sample, you need to estimate \(\alpha\) and \(\beta\).

For the regression to make sense, we’ve got to make some assumptions regarding the error term. Note that you don’t observe the error term. Anything that explains \(Y\) but is not specified in the regression is captured by the error term. For example, in the earning specification, we are missing out on experience, as earning increases with experience. Once not specified, this variable is observed by the error term. Say, if you build an eagle-eye model, by including all of the variables that should be in the model, and that you also have the functional form correctly specified. In this case, the error term just drops out and the model becomes deterministic, similar to \(y = mx + c\).

You can move from the simple regression to multiple regression by linearly adding variabes.

\[\begin{equation} Y_{i} = \alpha + \beta_1 X_{1i} + \beta_2 X_{2i} + \epsilon_{i} \tag{4.1} \end{equation}\]

where, \(X_2\) can be experience. To aim for simplicity, we’ll mainly focus on simple linear regression.