7.3 Motivation for Causal Forests

Let’s expand on estimating the average treatment effect of a treatment intervention \(W\). The specifics are listed as:

\(W_i \in \{0, \; 1\}\): treatment intervention
\(X_i\): covariates
\(Y_i\): response/outcome

In the parametric framework \(\tau\), the treatment effect, is estimated using the following specification:

\(Y_i = \tau W_i + \beta_1 X_i + \epsilon_i\)

The validity of \(\hat{\tau}\) as a causal estimand is justified under the following three assumptions.

Unconfoundedness: \(Y^{(0)}_i, \; Y^{(1)}_i \perp W_i | X_i\). Treatment assignment is independent of the potential outcome once conditioned on the covariates. In other words, controling for covariates makes the treatment assignment as good as random.
\(X_i\)s influence \(Y_i\)s in a linear way.
The treatment effect is homogeneous.

Assumption 1 is the identification assumption. In the traditional sense, one can control for \(X_i\)s in the regression framework and argue that this assumption is met. Even if all \(X\)s that influence the treatment assignment are observed (this is the assumption that we make throughout), we are unsure how \(X\)s affect the treatment. Often \(X\)s can affect treatment in a non-linear way. Assumptions 2 and 3 can be questioned and relaxed. One can let data determine the way \(X\) needs to be incorporated in the model specification (relaxing assumption 2). Moreover, treatment effects can vary across some covariates (relaxing assumption 3).

First, lets relax assumption 2. This leads to the following partially linear model:

\(Y_i = \tau W_i + f(X_i) + \epsilon_i \;............. equation 1\)

where, \(f\) is a function that maps out how \(X\) affects \(Y\). However, we don’t know \(f\) in practice. So, how do we go about estimating \(\tau\)?

The causal forest framework under GRF connects the old-school literature of causal inference with ML methods. Robinson (1988) shows that if two intermediate (nuiscance) objects, \(e(X_i)\) and \(m(X_i)\) are known, one can estimate \(\tau\). The causal forest framework under GRF utilizes this result. Here:

\(e(X_i)\) is the propensity score; the probability of being treated. \(E[W_i| X_i = x]\)
\(m(X_i)\) is the conditional mean of \(Y\). \(E[Y_i | X_i = x] = f(x) + \tau e(x)\)

Demeaning equation 1 (substracting \(m(x)\)) gives the following residual-on-residual regression:

\(Y_i - m(x) = \tau (W_i - e(x)) + \epsilon \; .............. equation 2\)

Intuition for equation (2) proceeds as follow. Note that \(m(x)\) is the conditional mean of Y given \(X_i = x\).¹⁰ This means that units with similar \(X\)s will have similar estimates for \(m(x)\) in \(W=\{0, \; 1\}\), which would mean that estimates on \(e(x)\) would also be similar for these units across both treatment and control group. Now, consider that the treatment is positive; this will show up in \(Y_i\). \(Y_i - m(x)\) will be higher for \(W=1\) compared to \(W=0\) for similar estimates of \(m(x)\). On the other side, \(W_i - e(x)\) is positive for \(W=1\) and negative for \(W=0\) for similar estimates of \(e(x)\). Such variations in the left and right hand side quantities will allow to capture postive estimates on \(\tau\).

To gain ML methods are used to estimate \(m(x)\) and \(e(x)\) and residual-on-residual regression is used estimate \(\tau\). It turns out that even noisy estimates of \(e(x)\) and \(m(x)\) can give ``ok” estimate of \(\tau\).

How to estimate \(m(x)\) and \(e(x)\)?

Use ML methods (boosting; random forest)
Use cross-fitting for prediction. prediction of observation \(i's\) outcome & treatment assignment is obtained without using the observation ``\(i\)“.

Lets take a look at residual-on-residual in the case of homogeneous treatment effect.

# generate data
n <- 2000
p <- 10
X <- matrix(rnorm(n * p), n, p)
X.test <- matrix(0, 101, p)
X.test[, 1] <- seq(-2, 2, length.out = 101)

# Generate W and Y 
W <- rbinom(n, 1, 0.4 + 0.2 * (X[, 1] > 0))
prob  <- 0.4 + 0.2 * (X[, 1] > 0)
Y <- 2.5 * W + X[, 2] + pmin(X[, 3], 0) + rnorm(n)

# Train regression forests
mx  <-  regression_forest(X, Y, tune.parameters = "all")
ex  <-  regression_forest(X, W, tune.parameters = "all") 

Wcen  <- W - ex$predictions
Ycen  <- Y - mx$predictions 

reg  <-  summary(lm(Ycen ~ Wcen))
reg

## 
## Call:
## lm(formula = Ycen ~ Wcen)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6687 -0.7147 -0.0139  0.7059  3.9119 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.00539    0.02359  -0.228    0.819    
## Wcen         2.45817    0.04791  51.305   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 1998 degrees of freedom
## Multiple R-squared:  0.5685, Adjusted R-squared:  0.5683 
## F-statistic:  2632 on 1 and 1998 DF,  p-value: < 2.2e-16

print(paste0("The treatment effect estimate based on residual-on-residual regression is: ", coefficients(reg)[2]))

## [1] "The treatment effect estimate based on residual-on-residual regression is: 2.45817277748273"

print(paste0("The true treatment effect is: ", 2.5))

## [1] "The true treatment effect is: 2.5"

We can also think of \(m(x)\) as the case when we ignore \(W\), although we know that treatment took place. This way, \(m(x) = \mu_{0}(x) + e(x)\tau\), where \(\mu_{0}(x)\) is the baseline conditional expectation without the treatment. This makes it easy to see that units with similar features will have similar estimates of \(m(x)\).↩︎