5.9 AIPW and Estimation
Augmented Inverse Probability Weighting (AIPW) provides a robust way to estimate ATE by alleviating the limitation of IPW estimate.
Following the IPW approach, estimation of ATE is given in equation (6). The other approach to estimate \(\tau\) is to think of it from the conditional response approach. Write \(\mu_{w}(x) = E[Y_i| \; X_i = x, W_i = w]\). Then:
\(\tau(x) = E[Y_i| \; X_i = x, W_i = 1] - E[Y_i| \; X_i = x, W_i = 0]\)
This is the regression outcome approach, where \(\tau = E[\mu_{1}(x) - \mu_{0}(x)]\). The consistent estimator can be formed by using: \(\hat{\tau}(x) = N^{-1} \sum_{i = 1}^{N} \mu_{1}(X_i) - \mu_{0}(X_i)\).
AIPW approach combines both IPW approach as well as regression outcome approach to estimate \(\tau\).
\(\hat{\tau}_{AIPW} = \frac{1}{N} \sum_{i = 1}^{N} (\mu_{1}(X_i) - \mu_{0}(X_i) + \frac{(Y_i - \hat{\mu}_1(X_i)). W_i}{\hat{e}(X_i)} - \frac{(Y_i - \hat{\mu}_0(X_i)). (1-W_i)}{1 - \hat{e}(X_i)})\)
ML approach using cross-fitting is used to estimate both \(\hat{e}(x)\) and \(\hat{\mu}_{w}(x)\). Following the cross-fitting structure, we can formally write the estimate for \(\tau\) as:
\(\hat{\tau}_{AIPW} = \lowerbracket{\frac{1}{N} \sum_{i = 1}^{N} (\mu_{1}^{-k(i)}(X_i) - \mu_{0}^{-k(i)}(X_i)}_{consistent \; estimate \; of \; \tau} + \frac{(Y_i - \hat{\mu}_1^{-k(i)}(X_i)). W_i}{\hat{e}^{-k(i)}(X_i)} - \frac{(Y_i - \hat{\mu}_0^{-k(i)}(X_i)). (1-W_i)}{1 - \hat{e}^{-k(i)}(X_i)})\)
The AIPW approach can be thought of estimating ATE taking the difference across conditional responses. Next, the residuals are adjusted using weights given by the propensity score. There are two attractive features of AIPW estimate. First, $_{AIPW} is consistent as long as \(\hat{e}(x)\) or \(\hat{\mu}_{w}(x)\) is consistent. This is because \(E[(Y_i - \hat{\mu}_{W_i}(X_i)) \approx 0\). Second, \(\hat{\tau}_{AIPW}\) is a good approximation to oracle \(\hat{\tau}_{AIPW}^{*}\) as long as \(\hat{\mu}(.)\) and \(\hat{e}(.)\) are reasonably accurate. If one estimate is highly accurate, then it can compensate lack of accuracy on the other estimate. If both \(\hat{\mu}(.)\) and \(\hat{e}(.)\) are \(\sqrt{n}\)-consistent7, then the following holds.
\(\sqrt{n}(\hat{\tau}_{AIPW} - \hat{\tau}_{AIPW}^{*}) \rightarrow_p 0\).
#######################
#
# Augmented IPW (aipw)
#
#######################
#n_features2 <- length(setdiff(names(dat2), "Y"))
# ranger
#funrf_ranger <- function(dat){
# rf2 <- ranger(
# Y ~ .,
# data = dat,
# mtry = min(ceiling(sqrt(n_features) + 20), n_features),
# respect.unordered.factors = "order",
# seed = 123,
# num.trees = 2000
# )
# return(rf2)
#}
# storing
<- matrix(0, nrow = nrow(index), ncol = K)
predict.mat2a <- predict.mat2a
predict.mat2b <- rep(0, K)
aipwK <- rep(nrow(index) / length(index), K)
weightaipK
for(i in seq(1:K)){
# E(Y | X, W = 1) using cross-fitting
<- fun.rf.grf(X = cbind(X[c(index[, -i]), ], W[index[, -i]]), W = Y[index[, -i]],
predict.mat2a[, i] predictkfold = cbind(X[c(index[, i]), ], 1))
# E(Y | X, W = 0) using cross-fitting
<- fun.rf.grf(X = cbind(X[c(index[, -i]), ], W[index[, -i]]), W = Y[index[, -i]],
predict.mat2b[, i] predictkfold = cbind(X[c(index[, i]), ], 0))
<- ((W[index[, i]] * (Y[index[, i]] - predict.mat2a[, i])) / (predict.mat[, i]))
noise -
1 - W[index[, i]]) * (Y[index[, i]] - predict.mat2b)) / (1 - predict.mat[, i]))
(((<- predict.mat2a[, i] - predict.mat2b[, i] + noise
score[[i]] <- mean(score[[i]])
aipwK[i]
}
<- weighted.mean(aipwK, weights = weightaipK)
aipw.grf <- sd(unlist(score))
sd.aipw <- aipw.grf - (sd.aipw / sqrt(length(unlist(score)))) * qnorm(1 - alpha/2)
ll <- aipw.grf + (sd.aipw / sqrt(length(unlist(score)))) * qnorm(1 - alpha/2)
ul
<- c("AIPW Est." = round(aipw.grf, 3), "se" = round(sd.aipw/(sqrt(length(W))), 3),
result.aipw "lower bound" = round(ll, 3), "upper bound" = round(ul, 3))
######################
# grf
######################
# Train a causal forest
<- causal_forest(X, Y, W)
tau.forest
# Estimate the conditional average treatment effect on the full sample (CATE).
<- average_treatment_effect(tau.forest, target.sample = "all")
grf_ate <- average_treatment_effect(tau.forest, target.sample = "treated")
grf_att
## PRINT ALL
#print(paste0("average treatment effect is: ", round(mean(pmax(X[, 1], 0)), 3)))
print(paste("treatment effects according to naive estimator:", round(mean(Y[which(W == 1)]) - mean(Y[which( W == 0)]), 3), sep = " "))
## [1] "treatment effects according to naive estimator: 15.558"
print(paste("treatment effects according to IPW using", K, "fold cross-fittin:", round(ipw.grf, 3), sep = " "))
## [1] "treatment effects according to IPW using 10 fold cross-fittin: 15.508"
print(paste("treatment effects according to IPW oracle:", round(ipw.oracle, 3), sep = " "))
## [1] "treatment effects according to IPW oracle: 16.496"
print(paste("treatment effects according to AIPW using", K, "fold cross-fitting:", round(aipw.grf, 3), sep = " "))
## [1] "treatment effects according to AIPW using 10 fold cross-fitting: 15.037"
print(paste("treatment effects according to GRF:", round(grf_ate[[1]], 3), sep = " "))
## [1] "treatment effects according to GRF: 15.031"
print(result.ipw)
## IPW estimate se lower bound upper bound
## 15.508 3.092 9.448 21.568
print(result.aipw)
## AIPW Est. se lower bound upper bound
## 15.037 0.031 14.975 15.098
print(grf_ate)
## estimate std.err
## 15.03144837 0.05209014
This means that \(\hat{\mu}(.)\) converges to \(\hat{\mu}\) at the ↩︎