7.4 Causal Forest

Both regression and causal forests consist of: 1) Building phase; and 2) estimation phase.

The intuition regarding the regression/causal forest can be gleaned using the following figure.

Figure 7.1: Figure 1. Adaptive weights

In this simple case, the sample is partitioned into \(N_1\) and \(N_2\) neighborhoods accorinng to the splitting rule that the squared difference in sub-sample specific treatment effect is the maximum, i.e., \(n_{N_1}n_{N_2}(\tau_{N_1} - \tau_{N_2})^2\) is the maximum. This by construction leads to constant treatment effect in the neighborhood, while the effects may vary across the neighborhoods. This intuition allows us to relax assumption 3, and re-write the partially linear estimation framework as: \(Y_i = \tau(x) W_i + f(X_i) + \epsilon_i\).

Here the estimate of the treatment effect \(\tau\) is allowed to vary with the test point \(x\).

In reference to Figure 1 above, \(N_1\) and \(N_2\) are neighborhoods where treatment effects are constant. To estimate the treatment effect of the test point \(x\), \(\tau(x)\), we would run a weighted residual-on-residual regression of the form.

\(\tau(x) := lm(Y_i - m(X_i)^{-i} \sim \tau(W_i - e(X_i)^{-i}), \; weights = 1\{X_i \in N(x)\}\)

where \(m(X_i)^{-i}\) and \(e(X_i)^{-i}\) are obtained from cross-fitting. The weights play a pivotal role here and takes a value 1 if \(X_i\) belongs to the same neighborhoods as \(x\). In the above figure, examples in \(N_2\) receive non-zero weight while those in \(N_1\) receive zero weight. However, this example only pertains to a tree. But we’d want to build a forest and apply the same analogy.

Adaptive weights. The forest consists of \(B\) trees, so the weights for each \(X_i\) pertaining to the test point \(x\) is based off of all \(B\) trees. The causal forest utilizes adaptive weights using random forests.

The tree specific weight for an example \(i\) at the \(b^{th}\) tree is given as: \(\alpha_{ib}(x) = \frac{1(X_i \in L_{b}(x))}{|L_{b}(x)|}\), where \(L(x)\) is the leaf (neighborhood) that consist of the test sample \(x\).

The forest specific weight for an example \(i\) is given as: \(\alpha_{i}(x) = \frac{1}{B} \sum_{b = 1}^{B} \frac{1(X_i \in L(x))}{|L(x)|}\)

It tracks the fraction of times an obsevation \(i\) falls on the same leaf as \(x\) in the course of the forest. Simply, it shows how similar \(i\) is to \(x\).

Regression Forest. It utilizes the adaptive weights given to an example \(i\) (\(i = \{1, \; 2, \; ..., N\}\)) and constructs a weighted average to form the prediction of \(x\). The prediction for \(x\) based on the regression forest is:

\(\hat{\mu}(x) = \frac{1}{B}\sum_{i = 1}^{N} \sum_{b=1}^{B} Y_{i} \frac{1(X_i \in L_{b}(x)}{|L_b(x)|}\)

\(= \sum_{i = 1}^{N} Y_{i} \alpha_{i}\)

Note that this is different from the traditional prediction from the random forest that averages predictions from each tree.

\(\hat{\mu}(x.trad) = \sum_{b = 1}^{B} \frac{\hat{Y}_b}{B}\)

Causal Forest. Causal forest is analogous to the regression forest in a sense that the target is \(\tau(x)\) rather than \(\mu(x)\). Conceptually the difference is encoded in the splitting criteria. While splitting, regression forest is based on the criterion: \(\max n_{N_1} n_{N_2}(\mu_{N_1} - \mu_{N_2})^2\), whereas the causal forest is based on \(\max n_{N_1} n_{N_2}(\tau_{N_1} - \tau_{N_2})^2\).

In a world with infinite computing power, for each potential axis aligned split that extends from the parent node, one would estimate treatment effects at two of the child nodes (\(\tau_{L}\) and \(\tau_{R}\)) and go for the split that maximizes the squared difference between child specific treatment effects. However, in practice this is highly computationally demanding and infeasible. The application of causal forest estimates \(\tau_{P}\) at the parent node and uses the gradient based function to guide the split. At each (parent) node the treatment effect is estimated only once.

Once the vector of weights are determined for \(i\)s, the following residual-on-residual is ran:

\(\tau(x) := lm(Y_i - m(X_i^{-i}) \sim \tau(x)(W_i - e(X_i)^{-i}), \; weights = \alpha_i(x)\)

This can be broken down as:

  1. Estimate \(m^{-i}(X_i)\) and \(e^{-i}(X_i)\) using random forest.
  2. Then estimate \(\alpha_i(x)\). For each new sample point \(x\), a vector of weight will be determined based on adaptive weighting scheme of the random forest. Note that the weights will change for each new test point.
  3. Run a weighted residual-on-residual regression given by the equation above.