5.9 AIPW and Estimation

Augmented Inverse Probability Weighting (AIPW) provides a robust way to estimate ATE by alleviating the limitation of IPW estimate.

Following the IPW approach, estimation of ATE is given in equation (6). The other approach to estimate \(\tau\) is to think of it from the conditional response approach. Write \(\mu_{w}(x) = E[Y_i| \; X_i = x, W_i = w]\). Then:

\(\tau(x) = E[Y_i| \; X_i = x, W_i = 1] - E[Y_i| \; X_i = x, W_i = 0]\)

This is the regression outcome approach, where \(\tau = E[\mu_{1}(x) - \mu_{0}(x)]\). The consistent estimator can be formed by using: \(\hat{\tau}(x) = N^{-1} \sum_{i = 1}^{N} \mu_{1}(X_i) - \mu_{0}(X_i)\).

AIPW approach combines both IPW approach as well as regression outcome approach to estimate \(\tau\).

\(\hat{\tau}_{AIPW} = \frac{1}{N} \sum_{i = 1}^{N} (\mu_{1}(X_i) - \mu_{0}(X_i) + \frac{(Y_i - \hat{\mu}_1(X_i)). W_i}{\hat{e}(X_i)} - \frac{(Y_i - \hat{\mu}_0(X_i)). (1-W_i)}{1 - \hat{e}(X_i)})\)

ML approach using cross-fitting is used to estimate both \(\hat{e}(x)\) and \(\hat{\mu}_{w}(x)\). Following the cross-fitting structure, we can formally write the estimate for \(\tau\) as:

\(\hat{\tau}_{AIPW} = \lowerbracket{\frac{1}{N} \sum_{i = 1}^{N} (\mu_{1}^{-k(i)}(X_i) - \mu_{0}^{-k(i)}(X_i)}_{consistent \; estimate \; of \; \tau} + \frac{(Y_i - \hat{\mu}_1^{-k(i)}(X_i)). W_i}{\hat{e}^{-k(i)}(X_i)} - \frac{(Y_i - \hat{\mu}_0^{-k(i)}(X_i)). (1-W_i)}{1 - \hat{e}^{-k(i)}(X_i)})\)

The AIPW approach can be thought of estimating ATE taking the difference across conditional responses. Next, the residuals are adjusted using weights given by the propensity score. There are two attractive features of AIPW estimate. First, $_{AIPW} is consistent as long as \(\hat{e}(x)\) or \(\hat{\mu}_{w}(x)\) is consistent. This is because \(E[(Y_i - \hat{\mu}_{W_i}(X_i)) \approx 0\). Second, \(\hat{\tau}_{AIPW}\) is a good approximation to oracle \(\hat{\tau}_{AIPW}^{*}\) as long as \(\hat{\mu}(.)\) and \(\hat{e}(.)\) are reasonably accurate. If one estimate is highly accurate, then it can compensate lack of accuracy on the other estimate. If both \(\hat{\mu}(.)\) and \(\hat{e}(.)\) are \(\sqrt{n}\)-consistent7, then the following holds.

\(\sqrt{n}(\hat{\tau}_{AIPW} - \hat{\tau}_{AIPW}^{*}) \rightarrow_p 0\).

#######################
#
# Augmented IPW (aipw)
#
#######################

#n_features2  <- length(setdiff(names(dat2), "Y"))
# ranger
#funrf_ranger  <- function(dat){
#    rf2  <- ranger(
#        Y ~ ., 
#        data = dat, 
#        mtry = min(ceiling(sqrt(n_features) + 20), n_features),  
#        respect.unordered.factors = "order", 
#        seed = 123, 
#        num.trees = 2000
#    )
#    return(rf2)
#}

# storing
predict.mat2a  <- matrix(0, nrow = nrow(index), ncol = K)
predict.mat2b  <- predict.mat2a
aipwK  <- rep(0, K)
weightaipK  <- rep(nrow(index) / length(index), K)

for(i in seq(1:K)){
    # E(Y | X, W = 1) using cross-fitting
    predict.mat2a[, i]  <- fun.rf.grf(X = cbind(X[c(index[, -i]), ], W[index[, -i]]), W = Y[index[, -i]], 
                                    predictkfold = cbind(X[c(index[, i]), ], 1))
    # E(Y | X, W = 0) using cross-fitting
    predict.mat2b[, i]  <- fun.rf.grf(X = cbind(X[c(index[, -i]), ], W[index[, -i]]), W = Y[index[, -i]], 
                                    predictkfold = cbind(X[c(index[, i]), ], 0))                                
    noise  <-  ((W[index[, i]] * (Y[index[, i]] - predict.mat2a[, i])) / (predict.mat[, i])) 
                - 
                (((1 - W[index[, i]]) * (Y[index[, i]] - predict.mat2b)) / (1 - predict.mat[, i]))
    score[[i]]  <-  predict.mat2a[, i] - predict.mat2b[, i] + noise
    aipwK[i]  <- mean(score[[i]])
}

aipw.grf  <- weighted.mean(aipwK, weights = weightaipK)
sd.aipw  <- sd(unlist(score)) 
ll  <-  aipw.grf - (sd.aipw / sqrt(length(unlist(score)))) * qnorm(1 - alpha/2)
ul  <-  aipw.grf + (sd.aipw / sqrt(length(unlist(score)))) * qnorm(1 - alpha/2)

result.aipw  <- c("AIPW Est." = round(aipw.grf, 3), "se" = round(sd.aipw/(sqrt(length(W))), 3), 
                    "lower bound" = round(ll, 3), "upper bound" = round(ul, 3))

######################
# grf
######################

# Train a causal forest 
tau.forest  <-  causal_forest(X, Y, W)

# Estimate the conditional average treatment effect on the full sample (CATE).
grf_ate  <- average_treatment_effect(tau.forest, target.sample = "all")
grf_att  <- average_treatment_effect(tau.forest, target.sample = "treated")


##  PRINT ALL

#print(paste0("average treatment effect is: ", round(mean(pmax(X[, 1], 0)), 3)))
print(paste("treatment effects according to naive estimator:", round(mean(Y[which(W == 1)]) - mean(Y[which( W == 0)]), 3), sep = " "))
## [1] "treatment effects according to naive estimator: 15.558"
print(paste("treatment effects according to IPW using", K, "fold cross-fittin:", round(ipw.grf, 3), sep = " "))
## [1] "treatment effects according to IPW using 10 fold cross-fittin: 15.508"
print(paste("treatment effects according to IPW oracle:", round(ipw.oracle, 3), sep = " "))
## [1] "treatment effects according to IPW oracle: 16.496"
print(paste("treatment effects according to AIPW using", K, "fold cross-fitting:", round(aipw.grf, 3), sep = " "))
## [1] "treatment effects according to AIPW using 10 fold cross-fitting: 15.037"
print(paste("treatment effects according to GRF:", round(grf_ate[[1]], 3), sep = " "))
## [1] "treatment effects according to GRF: 15.031"
print(result.ipw)
## IPW estimate           se  lower bound  upper bound 
##       15.508        3.092        9.448       21.568
print(result.aipw)
##   AIPW Est.          se lower bound upper bound 
##      15.037       0.031      14.975      15.098
print(grf_ate)
##    estimate     std.err 
## 15.03144837  0.05209014

  1. This means that \(\hat{\mu}(.)\) converges to \(\hat{\mu}\) at the ↩︎