Jump to ContentJump to Main Navigation
Matching, Regression Discontinuity, Difference in Differences, and Beyond$

Myoung-jae Lee

Print publication date: 2016

Print ISBN-13: 9780190258733

Published to British Academy Scholarship Online: August 2016

DOI: 10.1093/acprof:oso/9780190258733.001.0001

Show Summary Details

(p.209) A Appendix

(p.209) A Appendix

Source:
Matching, Regression Discontinuity, Difference in Differences, and Beyond
Author(s):

Myoung-jae Lee

Publisher:
Oxford University Press

This appendix examines prerequisite or supplementary topics, ordered in the way they are needed if the main text chapters are read sequentially. First, kernel nonparametric estimation is reviewed, the knowledge of which is helpful for matching and RD. Second, bootstrap is introduced, which will be useful for most estimators including matching estimators. Third, unobserved confounders and treatment endogeneity are discussed, which will be needed mostly for RD; although these are in the appendix because they do not fit the main theme of this book, their importance cannot be overstated. Fourth, “left-over” topics for DD are discussed.

A.1 Kernel Density and Regression Estimators

A.1.1 Histogram-Type Density Estimator

Suppose X1,,XN are iid random variables with df F(x) and density function f(x) that is bounded and continuously differentiable. Consider a B(N,π) random variable (binomial with N trials and the success probability π) where h>0 is a bandwidth:

i=1N1[|Xix|<h]with πP(|Xx|<h)=P(xh<X<x+h).
It holds that
E(i=1N1[|Xix|<h])=Nπand  V(i=1N1[|Xix|<h])=Nπ(1π).

A histogram-type density estimator for f(x) with interval size 2h is

f¯(x)1Ni=1N12h1[|Xix|<h]=1Nhi=1N121[|Xix|h<1].
Without 2h in the denominator, f¯(x) would be a histogram showing the proportion of observations falling in x±h. Since f¯(x) is an estimator for f(x) without parameterizing f(x) such as normal or logistic, f¯(x) is a ‘nonparametric estimator’ for f(x). For instance, if we know XN(μ,σ2) with unknown parameters μ and σ2, then f(x) (p.210) can be estimated parametrically with
1sN2πexp{12(xX¯sN)2},where X¯1NiXiand  sN21Ni(XiX¯)2.
In the following, we show f¯(x)pf(x); see, for example, Lee (2010a) for more on nonparametric density and regression estimation. Be mindful about the distinction between the data X1,,XN and the evaluation point x that does not have to equal any of X1,,XN.

Using the mean and variance for i1[|Xix|<h], we get

E{f¯(x)}=NπN2h=π2hand  V{f¯(x)}=π(1π)4Nh2=π2h1π2Nh.
With π=F(x+h)F(xh), Taylor expansion yields, for some x and O(h2) essentially denoting a constant times h2,
π={F(x)+f(x)h+f(x)h22}{F(x)f(x)h+f(x)h22}=2f(x)h+O(h2)π2h=f(x)+O(h).

Substitute π/2h=f(x)+O(h) into the above E{f¯(x)} and V{f¯(x)} to obtain

E{f¯(x)}=f(x)+O(h)and  V{f¯(x)}={f(x)+O(h)}1π2Nh.
Let h0 and Nh as N. Then, as N,
E{f¯(x)}f(x)and  V{f¯(x)}0;
h0 makes the bias E{f¯(x)}f(x) disappear, whereas Nh makes the variance disappear. This implies f¯(x)pf(x), because f¯(x)pE{f¯(x)} due to an LLN and E{f¯(x)}f(x).

A.1.2 Kernel Density Estimator

The role of the indicator function 1[|Xix|/h<1] in f¯(x) is weighting the ith observation: the weight is 1 if Xi falls within the h-distance from x, and 0 otherwise. Generalizing this weighting idea, we can think of a smooth weight depending on |Xix|. Let X be now a k×1 vector; in this case, h-distance becomes hk-distance; for example, the two dimensional analog of the interval (xh,x+h) is the rectangle around x of size (2h)2=4h2.

A ‘kernel density estimator’ is based on the smooth weighting idea:

f^(x)1Nhki=1NK(Xixh),
where K (called a ‘kernel’) is a smooth multivariate density that is symmetric about 0 (e.g., the N(0,Ik) density). The kernel estimator f^(x) includes f¯(x) as a special case when the ‘uniform kernel’ K(z)=1[|z|<1]/2 is used with k=1.

(p.211) Analogous to the earlier proof for f¯(x)pf(x), we can show that f^(x)pf(x). Furthermore, under some regularity conditions,

(Nhk)1/2{f^(x)f(x)}N{0,f(x)K(z)2z},
which can be used to construct (point-wise) confidence intervals (CI) for f(x). Although K(z)z=1, K(z)2z1 in general: we have to find K(z)2z for CIs. One simple way to find K(z)2z is using a ‘Monte Carlo integral’: generating Z1,,Zn iid N(0,1), it holds that
1ni=1nK(Zi)2ϕ(Zi)pK(z)2ϕ(z)ϕ(z)z=K(z)2zas  n;
the pseudo-sample average (i.e., the first term) can be used for K(z)2z.

Other than the uniform kernel and the N(0,1) kernel ϕ(.)= (2π) 1/2 exp{ (.) 2 /2 } , popular kernels are

34(1z2)1[|z|<1]:(trimmed) quadratic kernel,1516(1z2)21[|z|<1]:quartic or biweight kernel.
For k>1, ‘product kernels’ consisting of copies of univariate kernels can be used; for example, K(z)=j=1kϕ(zj).

The scalar h is a ‘bandwidth’ or ‘smoothing parameter’, whose role is analogous to that of the interval size in a histogram. If h is too small, there is no grouping (averaging) and f^(x) will be too jagged as x varies (a small bias but a large variance). If h is too large, f^(x) will show little variation (a large bias but a small variance). As for histogram interval size, there is no “the best rule” for choosing h in practice. When k=1,2, the best strategy is visual inspection: choose h such that the graph xf^(x) is neither too jagged nor too smooth; if anything, slightly undersmooth.

A practical rule of thumb for choosing h is hνN1/(k+4) with, say, 0.5ν3 if k is 1 or 2 with all components of X standardized. For example, if K(z)=j=1kϕ(zj) with z=(z1,,zk) used, then,

K(Xixh)=j=1kϕ[XjixjSD(Xj)νN1/(k+4)]=j=1k12πexp{12(XjixjSD(Xj)νN1/(k+4))2}.
More discussion on choosing K and h appears below.

A.1.3 Kernel Regression Estimator

A kernel regression estimator ρ^(x) for ρ(x)E(Y|X=x) in

Yi=ρ(Xi)+Uiwith  E(U|X)=0E(Y|X)=ρ(X)
(p.212) is
ρ^(x)(Nhk)1i=1NK((Xix)/h)Yi(Nhk)1i=1NK((Xix)/h)=g^(x)f^(x),
where the numerator of ρ^(x) is defined as g^(x). Rewrite ρ^(x) as
i=1N{K((Xix)/h)i=1NK((Xix)/h)}Yi
to see that ρ^(x) is a weighted average of Yi‘s where the weight is large if Xi is close to x and small otherwise.

Similarly for f^(x)pf(x), it can be shown that

g^(x)pE(Y|X=x)f(x),
which implies, when combined with f^(x)pf(x),
ρ^(x)pρ(x).
Analogously to the asymptotic normality of (Nhk)1/2{f^(x)f(x)}, it holds under some regularity conditions that
(Nhk)0.5{ρ^(x)ρ(x)}N{0,V(U|x)K(z)2zf(x)}.
V(U|x)=E(U2|x) can be estimated using the residual U^iYiρ^(Xi):
V^(U|x)(Nhk)1i=1NK((Xix)/h)U^i2(Nhk)1i=1NK((Xix)/h).

To implement kernel estimation, one has to choose K and h. As for K, it is known that the choice of kernel makes little difference. But the story is quite different for h, because the choice of h makes a huge difference. When k=1 or 2, other than the above rule of thumb, a good practical method is drawing f^(x) or ρ^(x) over a reasonable range for x and choosing h such that the curve estimate is not too smooth (if h is too big) nor too jagged (if h is too small), as in choosing h for density estimation. If k>2, this balancing act is hard to do. In this case, it is advisable to choose h by the following scheme.

Consider minimizing for h:

i{Yiρ^i(Xi)}2where  ρ^i(Xi)j=1,jiNK((XjXi)/h)Yjj=1,jiNK((XjXi)/h);
ρ^i(Xi) is a ‘leave-one-out estimator’ for ρ(Xi). This method of choosing h is called ‘cross-validation (CV)’, which works well in practice. For estimating f(x), Yi is irrelevant, and a CV minimand for h is
1N2hkijK(2)(XiXjh)2N(N1)hkijiK(XiXjh),
where K(2)(a)K(az)K(z)z.

(p.213) A.1.4 Local Linear Regression

The kernel estimator ρ^(x) can be obtained by minimizing the following with respect to (wrt) a:

i(Yia)2K(Xixh);
ρ^(x) may be viewed as predicting Yi locally around x using only the intercept a. A variation of ρ^(x) is obtained using a line (an intercept and a slope) centered at x, which is local linear regression (LLR) minimizing
i{Yiab(Xix)}2K(Xixh)
wrt a and b. The intercept estimator a^(x) for a is the LLR estimator for ρ(x), whereas the slope estimator b^(x) for b is the LLR estimator for the derivative ρ(x)/x. See Fan (1996) for LLR in general.

To be specific, the LLR estimator for ρ(x) is (01×k is the 1×k null vector)

a^(x)=(1,01×k){X(x)W(x)X(x)}1{X(x)W(x)Y},
where Y(Y1,,YN), W(x)diag{K((X1x)/h),,K((XNx)/h)}, and
X(x)N×(1+k)[1,(X1x)1,(XNx)].

Compared with the LLR estimator a^(x), the usual kernel estimator ρ^(x) may be called the ‘local constant regression (LCR)’ estimator. Relatively speaking, LLR is less biased but more variable than LCR; this is the classic trade-off between bias and variance. The advantage of being less biased in LLR tends to be visible around boundary points of X support and peaks and troughs of E(Y|X=x).

In Figure A.1 with N=200, we generated Y with Y=XX2+U where U,XN(0,1) with U⨿X, and LCR and LLR estimates were obtained with K=ϕ (the N(0,1) kernel). In the left panel, we set h=0.5×SD(X)N1/5, and h=2.5×SD(X)N1/5 in the right panel is five times greater than the h in the left panel. The solid lines are the true regression function E(Y|X)=XX2, whereas the dashed and dotted lines are the LCR and LLR estimates, respectively. In the left panel, LCR and LLR are almost the same, and both are undersmoothed in view of the wiggly parts. In the right panel, both are oversmoothed with the larger bandwidth, but LCR is more biased than LLR as LCR clearly oversmooths the peak.

A.2 Bootstrap

This section reviews bootstrap, drawing on Lee (2010a) who in turn drew on Hall (1992), Efron and Tibshirani (1993), Shao and Tu (1995), (p.214) Davison and Hinkley (1997), Horowitz (2001), and Efron (2003). See also Van der Vaart (1998), Lehmann and Romano (2005), and DasGupta (2008). In the main text, we mentioned ‘nonparametric (or empirical) bootstrap’ many times to simplify asymptotic inference. Hence, before embarking on the review of bootstrap in general, we quickly explain nonparametric bootstrap in the following.

A Appendix

Figure A.1 E(Y|X) vs. LCR and LLR Estimates

Given an original sample of size N and an estimate α^N for a parameter α, (i) resample from the original sample with replacement to construct a pseudo sample of size N; (ii) apply the same estimation procedure to the pseudo sample to get a pseudo estimate α^Nb; (iii) repeat this B times (e.g., B=500—the higher the better) to obtain α^Nb, b=1,,B; (iv) use quantiles for each component of α^N(b) to construct a confidence interval (CI) for the corresponding component of α; for example, the 0.025 and 0.975 quantiles for the second components of (α^N1,,α^NB) gives a 95% CI for the second component of α.

Instead of CIs, sometimes the variance estimator B1b=1B(α^Nbα^N)(α^Nbα^N) is used as an asymptotic variance for α^α. Although CIs from the bootstrap are consistent as long as the estimation procedure is “smooth,” the consistency of the variance estimator is not known in general.

In the online appendix, the program ‘BootAvgSim’ illustrates how to do nonparametric bootstrap (as well as ‘bootstrap percentile-t method’ to be explained below) for mean. The program ‘RegImpPsNprSim’ shows how to implement nonparametric bootstrap in regression imputation approach, which can be easily modified for other approaches’ bootstrap.

(p.215) A.2.1 Review on Usual Asymptotic Inference

Statistical inference is conducted with CI and hypothesis test (HT). For a k×1 parameter β and an estimator bNpβ, CI and HT are done using the asymptotic distribution of a transformation of bN: in most cases, for some variance V,

N(bNβ)N(0,V)NV1/2(bNβ)N(0,Ik).
The test statistic (TS) NV1/2(bNβ) is asymptotically pivotal because its asymptotic distribution is a known distribution as in N(0,Ik).

To do inference with CI, note N(tbNtβ)N(0,tVt) for a known k×1 vector t. With ζα denoting the α quantile of N(0,1) and VNpV, as N,

P{ζ1α/2<N(tbNtβ)tVNt<ζ1α/2}P{ζ1α/2<N(0,1)<ζ1α/2)=1αP{tbNζ1α/2tVNtN<tβ<tbN+ζ1α/2tVNtN)1α.
This gives a CI for tβ; for example, t=(0,,0,1) and α=0.05 yields a symmetric asymptotic 95% CI for βk. For H0:tβ=c for a specified value of c (typically c=0), we reject the H0 if c is not captured by the CI. The false rejection probability (i.e., the type I error) is α.

Alternatively to using CI, we can use an asymptotically pivotal TS to conduct a HT: if the realized value of the TS is “extreme” for the known asymptotic distribution under H0, then the H0 is rejected. For instance, under H0:tβ=c, we can use

N(tbNc)tVNtN(0,1),where the unknown tβis replaced by cin H0.
For two-sided tests, we choose the critical region (,ζ1α/2) and (ζ1α/2,), and reject H0 if the realized value of the TS falls in the critical region (with the false rejection probability α). A better way might be looking at the p-value
2×P{N(0,1)>|realized value of N(tbNc)tVNt|}
to reject the H0 if the p-value is smaller than α. For one-sided test, this HT scenario requires minor modifications.

Although CI and HT are equivalent to (i.e., “dual” to) each other in the case of using N(bNβ)N(0,V), there are many HTs whose corresponding CIs are hard to think of. For instance, H0: the distribution of Y is symmetric about 0, or H 0 :E( Y 4 )=3E( Y 2 ) .

(p.216) A.2.2 Bootstrap to Find Quantiles

Define the exact distribution function (df) for a statistic TN(F):

GN(c;F)P{TN(F)c},where  TN(F)VN(F)1/2N{bN(F)β(F)},
where F is the distribution for the original sample and VN is a ‘scaling number (matrix)’. Regard β as a scalar for simplification. Keep in mind the distinction between a (probability) distribution and its df; a df is just a deterministic function.

We desire GN(c;F): how TN(F) behaves with a given sample of size N when the sample is drawn from the true distribution F. The last display makes it explicit that the exact, not asymptotic, distribution of TN(F) depends on the underlying distribution F. The usual large sample inference in the preceding section uses the approximation (the ‘asymptotic df’ of TN(F)) for GN(c,F):

G(c;F)limNGN(c,F).
Often TN(F) is asymptotically pivotal: G(c;F) does not depend on F; for example, G(c,F)=P{N(0,Ik)c}. We may then write just G(c) instead of G(c;F). In this case, the large sample approximation G(c;F) to GN(c;F) is done only through one route (“through the subscript”). “Two-route” approximation is shown next.

Suppose TN(F) is not asymptotically pivotal; for example, G(c,F)=Φ{c/σ(F)} where the parameter of interest is the mean and σ(F) is the SD. In this nonpivotal case, the nuisance parameter σ(F) should be replaced by an estimator, say, sNσ(FN). In a case like this with an asymptotically nonpivotal TN(F), G(c,FN) is used as a large sample approximation for GN(c;F) due to the estimated nuisance parameter: two routes of approximation are done between GN(c;F) and G(c,FN), through the subscript and FN.

Suppose that GN(c,F) is smooth in F in the sense

GN(c;FN)GN(c;F)p0as N,where FNis the empirical distribution for F;
recall that the empirical distribution FN gives probability N1 to each observation Zi, i=1,,N. Bootstrap uses GN(c;FN) as an approximation to GN(c;F) where the approximation is done only through FN. This is in contrast to the large sample approximation G(c) or G(c,FN) to GN(c,F).

Whether the last display holds depends on the smoothness of GN(c;F) as a functional of F. This also shows that consistent estimators for F other than FN (e.g., a smoothed version of FN) may be used in place of FN. This is the basic bootstrap idea: replace F with FN and do the same thing as with F. Since the smoothness of GN(c,F) is the key ingredient for bootstrap, if the “source” TN(F) is not smooth in F, bootstrap either will not work as well (e.g., quantile regression is “one-degree” less smooth than LSE, and bootstrap works for quantile regression in a weaker sense than for LSE) or does not work at all. Bear in mind the different versions of G that appeared so far:

Nonoperational

Operational

Finite Sample

GN(c;F) for target

GN(c;FN)in bootstrap

Asymptotic

G(c;F)

G(c) (pivotal); G(c;FN) (nonpivotal)

(p.217) Using GN(c;FN) means treating the original sample (Z1,,ZN) as the population—that is, the “population distribution” is multinomial with P(Z=Zi)=N1. Specifically, with F replaced by FN, we have

GN(c;FN)=P{TN(FN)c}=P[VN(FN)1/2N{bN(FN)β(FN)}c]
and β(FN) is the parameter for the empirical distribution. For instance, suppose β(F)=E(Z)=zF(z) and the estimator for β is the sample mean bN=Z¯. Considering a pseudo sample Z1,,ZN drawn from FN with replacement—some observations in the original sample get drawn multiple times while some get never drawn—we have
β(FN)=zFN(z)=1NiZi=Z¯as FNassigns weight 1Nto each support point Zi,bN(FN)=Z¯1NiZi,pseudo sample mean estimator for the parameter β(FN)=Z¯,V(FN)=1NiZi2Z¯2=1Ni(ZiZ¯)2,which is also the sample variance ‘VN(F)’,VN(FN)=1NiZi2Z¯2=1Ni(ZiZ¯)2,pseudo sample variance to estimate V(FN).
This example illustrates that bootstrap approximates the distribution of (scaled) ZE(Z) with that of (scaled) Z¯Z¯. That is, the relationship of bN=Z¯ to β=E(Z) is inferred from that of bN=Z¯ to bN=Z¯.

GN(c;FN) may look hard to get, but it can be estimated as precisely as desired because FN is known. One pseudo sample of size N gives one realization of TN(FN). Repeating this NB times yields NB-many pseudo realizations, bN(1),,bN(NB). Due to a LLN applied with the “population distribution FN for the pseudo sample”, we get

1NBj=1NB1[VN(j)1/2N(bN(j)bN)c]GN(c;FN)as NB.
(p.218) This convergence is ‘in probability’ or ‘a.e.’ conditional on the original sample Z1,,ZN. Hence there are two phases of approximation in bootstrap: the first is with NB for a given N (as in this display), and the second is with N for GN(c;FN)GN(c;F)p0. Since we can increase NB as much as we want, we can ignore the first phase of approximation to consider the second phase only. This is the bootstrap consistency that we take as a fact here: quantiles found from the pseudo estimates are consistent for the population quantiles.

A.2.3 Percentile-t and Percentile Methods

Suppose TN=VN1/2N(bNβ) is asymptotically pivotal. Using bootstrap quantiles ξN,α/2 and ξN,1α/2 of TN(1),,TN(NB), we can construct a (1α)100% bootstrap CI for β:

ξN,α/2<VN1/2N(bNβ)<ξN,1α/2(bNξN,1α/2VN1/2N,bNξN,α/2VN1/2N)for β.
This way of constructing a CI with an asymptotically pivotal TN is called percentile-t method—‘percentile’ because percentiles (i.e., quantiles) are used and ‘t’ because TN takes the form of the usual t-value that is asymptotically pivotal.

There is also percentile method using bN. Define the exact df for bN as

JN(c;F)P{bN(F)c}.
The bootstrap estimator for JN(c,F) is NB1j=1NB1[bN(j)c]. Denoting the empirical df of bN(1),,bN(NB) as KN, a (1α)100% CI for β is
{KN1(α2),KN1(1α2)}.
Different from percentile-t method is that quantiles of bN(1),,bN(NB) are used, not quantiles of TN(1),,TN(NB). One disadvantage with this CI is that bN may fall outside the CI (or near one end of the CI). To avoid this problem, sometimes a ‘bias-corrected CI’ gets used as in the following paragraph.

A two-sided (1α)100% bias-corrected CI when the asymptotic distribution is normal is, with Φ being the N(0,1) df,

(KN1[Φ{ζα/2+2Φ1(KN(bN))}],KN1[Φ{ζ1α/2+2Φ1(KN(bN))}]).
If bN is the median among the pseudo estimates so that KN(bN)=0.5, then Φ1(KN(bN))=0: the bias-corrected CI reduces to the preceding {KN1(α/2),KN1(1α/2)}. If bN is smaller than the pseudo estimate median, then KN(bN)<0.5, and Φ1(KN(bN))<0:the bias-corrected CI shifts to the left so that bN moves to the center of the CI.

A natural question at this stage is why bootstrap inference might be preferred to the usual asymptotic inference. First, in terms of convenience, as long as the computing power allows, bootstrap is easier to use as it just repeats the same estimation procedure NB times, which makes bootstrap a “no-brain” method. Second, estimating asymptotic (p.219) variance may be difficult, which bootstrap avoids. Third, the bootstrap approximation error is equal to or smaller than the asymptotic approximation error; for example,

G(c;FN)GN(c;F)=Op(N1/2),whereas  GN(c;FN)GN(c;F)=Op(N1).
For asymmetric CIs, the smaller-order approximation holds only for percentile-t method; for symmetric CI, it holds for both percentile-t and percentile methods. Whenever possible, use percentile-t bootstrap based on a pivotal statistic.

A.2.4 Nonparametric, Parametric, and Wild Bootstraps

Hypothesis testing can be done with bootstrap CIs (or confidence sets), but sometimes CIs are inappropriate—for example, various model goodness-of-fit tests. In such cases, the issue of bootstrap test appears. The key issue in bootstrap test is how to impose the null hypothesis in generating pseudo samples. Although we only mentioned sampling from the original sample with replacement so far—this is nonparametric/empirical bootstrap’—bootstrap test brings about a host of other ways to generate pseudo samples, depending on how the null hypothesis is imposed.

To appreciate the importance of imposing H0 on pseudo samples, suppose ‘H0:F is N(0,1)‘. Under the H0, nonparametric bootstrap would yield a pseudo sample consisting of “nearly” N(0,1) random variables, and the test with nonparametric bootstrap would work because the realized TS for the original sample will be similar to the pseudo sample TSs. Now suppose that the H0 is false because the true model is N(5,1). In this case, we want to have the realized TS to be much different from the pseudo TSs so that the bootstrap test rejects. If we do not impose the H0 in generating the pseudo samples, then both the original data and pseudo samples will be similar because they all follow more or less N(5,1), resulting in no rejection. But if we impose ‘H0: F is N(0,1)‘ on the pseudo samples, then the realized TS for the original sample (centered around 5) will differ greatly from the TSs from the pseudo sample (centered around 0), leading to a rejection.

Suppose H0:f=fo(θ); that is, the null model is parametric with an unknown parameter θ. In this case, θ may be estimated by the MLE θ^, and the pseudo data can be generated from fo(θ^). This is parametric bootstrap where imposing the H0 on pseudo data is straightforward. For instance, if H0:F=Φ in binary response, then (i) θ in Xθ can be estimated with probit θ^, (ii) a pseudo observation Xi can be drawn from the empirical distribution of X1,,XN, and (iii) Yi can be generated from the binary distribution with P(Yi=1|Xi)=Φ(Xiθ^).

Often we have the null model that is not fully parametric, in which case parametric bootstrap does not work and this makes imposing the null on pseudo data far from straightforward. For instance, the null model may be just a linear model Yi=Xiβ+Ui without the distribution of (X,U) specified. In this case, one way of imposing the null goes as follows. Step 1: sample Xi from the empirical distribution of X1,,XN. Step 2: sample a residual U^i from the empirical distribution of the residuals U^iYiXibN, i=1,,N. Step 3: generate YiXibN+U^i. Repeat this N times to get a pseudo-sample of size N.

(p.220) In the bootstrap scheme for the linear model, U^ is drawn independently of X, which is fine if U⨿X. But if we want to allow for heteroskedasticity, this bootstrap does not work because U^ is generated independently of X; instead, wild bootstrap is suitable: with Xi=Xi, generate Yi=XibN+ViU^i where Vi takes ±1 with probability 0.5. Since E(V)=0 and E(V2)=1, we get

E(ViU^i|Xi)=E(Vi|Xi)E(U^i|Xi)=0andE(Vi2U^i2|Xi)=E(Vi2|Xi)E(U^i2|Xi)E(Ui2|Xi),
preserving the heteroskedasticity in the pseudo sample.

A.3 Confounder Detection, IVE, and Selection Correction

A.3.1 Coherence Checks

Typically, a causal analysis for effects of D on Y starts with finding an association between them. Then we control for observed variables X to see whether the association still stands firm despite X taken out of the picture; in a sense, we try to “negate” the causal claim using X. For unobserved variables ε, we cannot do the same; instead, we try to prop up the prima facie causal finding by showing that the finding is coherent, that is, consistent with auxiliary findings. Sometimes we have an idea on variables lurking in ε, and sometimes we do not. In either case, there are a number of ways to show coherence.

Suppose that a positive effect has been found initially. One would expect that if the treatment level is increased, say to the double dose, then the effect will become stronger. Likewise, if the treatment is reduced to half the dose, then the effect will become weaker. Furthermore, if the treatment is reversed, a negative effect will occur. Of course, the true relation between D and Y can be highly nonlinear, being negative or positive depending on the level of D. Barring such cases, however, confirming those expectations supports the initial causal finding. If the expectations do not hold up, then the initial causal finding is suspect: it might have been due to some ε. Instead of using an extra treatment group with double/half/reverse treatment, using another response not supposed to be affected by D or another control group supposed to be similar to the original control group also can help detect the presence of ε. Examples for these appear below.

Partial Treatment

Suppose we examine job-training effects on reemployment or not within certain days (e.g., reemployment within 100 days). The T group get the job training and the C group do not. Suppose there is a dropout group (“D” group receiving only part of the required training). The three groups are ordered in terms of treatment dose: C group with no training, D group with partial training, and T group with the full training. If there is no unobserved confounder ε, what is expected (with X controlled) is

CDT(bad treatment) orTDC(good treatment),
(p.221) where “CD“ means that the reemployment probability is greater for the C group than for the D group.

Suppose that the observed finding is DCT. There are many possible scenarios for this. One is that the training is harmful, but smart trainees see this and thus drop out; the D group find jobs sooner than the C group because the D group is smarter, resulting in DCT. In this scenario, ε is how smart the person is. Another scenario is that the training is harmful but the D group drops out because they found a job due to a lower reservation wage, resulting in DCT. In this scenario, ε is reservation wage.

If one thinks further, many more scenarios will come up, possibly based on different unobserved confounders. It is not far-fetched to say that in observational studies, negating those scenarios one by one to zoom in on one scenario—and thus presenting a coherent story—is the goal. In short, there is a coherence problem in the job-training example with DCT, which needs to be explained before declaring the treatment good or bad. Had the D group been ignored, only CT would have been looked at to conclude a bad job training. The partial treatment group, which is an extra treatment group using an extra dose, casts doubt on the causal finding based on only the full versus no treatment groups.

As another example, Ludwig et al. (2001) examined effects of moving into a lower poverty area on crime rates. In observational data, people have some control over where they live, and living in high/low-poverty area has an element of self-selection, which Ludwig et al. avoided using experimental data. Since 1994, 638 families from a high-poverty area in Baltimore were randomly assigned to three groups: the T group relocating into an area with poverty rate under 10%, the D group without constraints on poverty rate for relocation, and the C group. The D group is partially treated, because they could (and some did) move into an area with poverty rate higher than 10%. The outcome variables are juvenile arrest records. A total of 279 teens were arrested 998 times in the pre- and postprogram periods. The crimes were classified into violent crimes (assault, robbery), property crimes, and the other crimes (drug offenses, disorderly conduct).

Part of their Table III for juveniles of ages 11–16 is shown below (some covariates are controlled). The second column shows the mean number of arrests for 100 teens per quarter. The third column shows the treatment effect of the T versus C groups, whereas the fourth column shows the treatment effect of the D versus C groups. The two entries with one asterisk are significant at 10% level, and the entry with two asterisks is significant at 5%.

Effects of Moving into Low Poverty Area on Crimes

Mean arrests for C

T versus C (SD)

D versus C (SD)

Violent crimes

3.0

1.6 (0.8)

1.4 (0.8)

Property crimes

2.0

1.3 (0.8)

0.5 (0.8)

Other crimes

3.3

0.7 (1.0)

1.3 (1.0)

All crimes

8.3

0.9 (1.8)

3.1 (1.8)

(p.222) The treatment indeed seems to lower crime rate; notice a scope for property crime increase, because a high-income area presents more opportunity for property crime. But in terms of all crime rates, we have the ranking CTD, which is strange, for one would expect CDT. One possible scenario is that the areas for the T group may have higher arrest probabilities than the areas for the D group; in other words, crime rates are overestimated in the T group areas relative to the D group areas. In this case, arresting intensity/effort is an unobserved confounder.

Reverse Treatment

If a treatment change from 0 to 1 has a positive effect, then the reverse change from 1 to 0 should have a negative effect, which is another way of being coherent. This can be checked out with two similar groups where one group experience 0 to 1 and the other group 1 to 0; for each group, before-and-after (BA) design is implemented. Contrast this to difference in differences (DD) where one group experience 0 to 1 and the other group no change; reverse treatment design is better than DD, because distinction between the two groups is clearer if indeed the treatment has some effect.

It is also possible to try reverse treatment design on a single group: the treatment change is 0 to 1 to see the effect, and then reversed back to 0 to see the reverse effect. If the treatment is effective, Y will take on level A, B, and back to A, as the treatment changes from 0 to 1, and back to 0. Comparing this one-group three-point design with the preceding two-group two-point design (here, ‘point’ refers to time points), in the former, we do not have to worry about the difference between the two groups but we do have to be concerned about the time effect because three time points are involved. In the latter, the opposite holds.

For time-series or panel data, suppose we use BA to find a positive effect of D (water fluoridation) on Y (tooth decay) over five years. In the beginning of the period, fluoridation started (treatment changing from 0 to 1) and lasted for five years. Comparing the tooth decay proportions at the beginning and end of the five year period, the proportion has been found to be lowered. But during this period, other things may have changed to affect Y. For instance, healthy lifestyles might have been adopted (lower sugar diet due to enhanced health concern including oral health), and this could have been the actual reason for the lower tooth decay proportion. To refute this possibility, suppose fluoridation stopped (the treatment changing from 1 to 0) and stayed that way for another five years. Suppose tooth decay proportion increased during this second five-year period. If unhealthy lifestyle was adopted during this period, then again this might explain the higher tooth decay proportion, which is unlikely; hence, the reverse treatment corroborates the initial finding. This example is a modified version of actual studies on fluoridation referred to in Gordis (2000, 7–9).

Multiple Responses

There have been claims on beneficial effects of moderate drinking of alcohols— particularly red wine—on heart disease. Since there are potential risks in drinking, it is difficult to do an experiment, and studies on that causal link are observational where (p.223) people self-select their drinking level. Thun et al. (1997) examined a large data set on older U.S. adults with N= 490,000. In 1982, the individuals reported on their drinking habits, and 46,000 died during the nine-year follow-up. In the study, drinking habit was measured separately for beer, wine, and spirits; the sum was then recorded as the total number of drinks per day. It was found that moderate drinking reduces death rates from cardiovascular diseases.

Part of their Table 4 for women is

Deaths (SD) per 100,000 and Number of Drinks per Day

Cause of death

0

Less than 1

1

2–3

4 or more

Cirrhosis, alcoholism

5.0 (0.9)

4.3 (0.9)

7.7 (1.9)

10.4 (1.9)

23.9 (4.5)

Cardiovascular diseases

335 (7.8)

230 (7.5)

213 (10.2)

228(9.8)

251(16)

Breast cancer

30.3(2.1)

33.3(2.4)

37.6(4.1)

45.8(4.2)

29.1(5.3)

Injuries & external causes

22.7 (1.9)

25.5 (2.2)

17.7 (2.8)

18.9 (2.7)

17.1 (4.0)

Examining the cirrhosis and alcoholism row, the death rate increases as more alcohol is consumed. The death rate for cardiovascular diseases decreases for moderate drinking but increases as the number of drinks goes up. The death rate from breast cancer increases substantially but then it drops for four drinks or more, which casts some doubt on the study. The most problematic is the death rate for injuries and external causes, which is decreasing for one drink or more.

If we do a randomized study, then we would expect that drinkers have more accidents (thus a higher death rate for injuries and external causes), because being drunk makes the person less alert and less careful. Being otherwise suggests that drinkers may be systematically different from nondrinkers. Drinkers may be more careful and attentive to their health and lifestyle, and this may be the real reason for the lower cardiovascular disease death rate. Wine drinkers are sometimes reported to have healthy lifestyle in the United States. This may have to do with the fact that wines are more expensive than beers and better educated people with more money drink wines. That is, better education could be the common factor driving wine drinking and healthy lifestyle in the United States. Looking at the extra response variable (death rate due to injuries and external causes), we can see a possible hidden bias due to the unobserved confounders such as alertness/carefulness and healthy lifestyle due to high income and education.

In the drinking example, the extra response variable is expected to be affected by the treatment in a known direction. There are cases where an extra response variable is not supposed to be affected at all. For example, consider the effect of a lower speed limit on the number of traffic accidents. One unobserved confounder is police patrol intensity: it is possible that the police patrol is intensified to enforce the lower speed limit, which then reduces the number of traffic accidents, whereas the real effect of the lower speed limit per se is nil. In this example, an extra response variable can be crime rate not supposed to be affected by speed limit. If crime rate does not change following the speed limit change, then we can rule out the possibility of the intensified (p.224) patrol efforts affecting traffic accidents. Of course, the best thing would be to find a variable representing police patrol effort and see if it really changed. When this is not done, the next best thing would be to use the extra response variable (crime rate) to detect changes in police patrol intensity.

Multiple Control Groups

Zero is an intriguing number, and no treatment can mean many different things. With drinking as the treatment, it may mean the real nondrinkers, but it may also mean the people who used to drink heavily long time ago and then stopped for health reasons (ex-drinkers). With a job training as the treatment, no treatment can mean people who never applied to the program, but it can also mean people who had applied but then were rejected. As the real nondrinkers differ from the ex-drinkers, the nonapplicants differ from the rejected. In the job training example, there are two control groups: the nonapplicants and the rejected. Both groups did not receive the treatment, but they can differ in terms of unobserved confounders.

It is possible to detect the presence of unobserved confounders using multiple control groups. Let C denote the nonapplicants and C r the rejected. Suppose E(Y|X,C)E(Y|X,Cr). This must be due to an unobserved variable ε, raising the suspicion that the T group might be also different from C and C r in terms of ε. Specifically, to ensure the program success, the program administrators may have “cherry-picked” applicants with higher values of ε that can be quality or ability. Then Cr comprises people with low ε. In this example, comparing the C group with the extra control group Cr helps one see the presence of an unobserved confounder.

Card and Krueger (1994, 2000) analyzed the effect of minimum wage increase on employment. In 1992, New Jersey increased its minimum wage from $4.25 to $5.05 per hour. From New Jersey and the eastern Pennsylvania, 410 fast food restaurants were sampled before and after the minimum wage change (treatment) for DD. That is, PA fast food restaurants were used as a control group. Not just PA but also NJ fast food restaurants with starting wage higher than $5 were used as another control group because those NJ restaurants were unlikely to be affected by the treatment.

The next table is part of Table 3 in Card and Krueger (1994) and it shows the average (SD), where ‘FTE (full-time equivalent)’ is the number of full-time workers plus 0.5 times the number of part-time workers, ‘NJ ($4.25)’ is for the NJ restaurants with the pretreatment starting wage $4.25 (affected by the treatment), and ‘NJ($5)’ is for the NJ restaurants with the pretreatment starting wage $5 or above (little affected by the treatment).

DD with Two Control Groups for Minimum Wage Effect on Employment

NJ

PA

NJ-PA

NJ ($4.25)

NJ ($5)

NJ ($4.25)-NJ($5)

FTE before

20.44

23.33

2.89

19.56

22.25

2.69

FTE after

21.03

21.17

0.14

20.88

20.21

0.67

Difference

0.59

2.16

2.75 (1.36)

1.32

2.04

3.36 (1.48)

(p.225) From the last row of the left half, despite the minimum wage increase, NJ FTE increased whereas PA FTE decreased; the DD estimate is significantly positive, showing no negative effect of the minimum wage increase. From the right half using NJ($5) as a second control group, almost the same finding is seen. The second control group renders a coherent story, behaving similarly to the first control group. In this DD, no covariates were controlled for; Card and Krueger (1994) tried many regression models, only to conclude no evidence of employment reduction due to the minimum wage increase.

A.3.2 IVE and Complier Effect

Two-Stage LSE and IVE

Imagine a health education program where individuals are randomized in (δ=1) or out (δ=0) to be given some education on health benefits of exercise D, and we are interested in the effects of exercise D on health Y, not the effects of δ on Y. One concern, however, is that there may be unobserved confounders ε affecting both D and Y to make D an endogenous treatment. For instance, ε may be laziness which affects exercise D and health Y. In this case, a simple LSE of Y on (1,D) will not work. One solution for this problem is using the ‘exogenous variation’ in D caused by the randomization dummy δ. Since δ does not affect Y directly, ‘δDY‘ holds; δ affects Y only indirectly through D. Instrumental variable estimator (IVE) can be applied with δ as an instrumental variable (IV) for D.

To formalize the idea, suppose, for mean-zero errors ε and U,

Di=α0+αδδi+εiwith αδ0and  Yi=β0+βdDi+Ui,
where COR(D,U)0 due to laziness lurking in ε and U, but COR(δ,U)=0. The interest is on βd that is the slope of the endogenous treatment D. The assumption αδ0 is critical for δ to give an exogenous variation to D; in the foregoing example, the education on health benefits of exercise should make at least some people exercise. Doing the LSE of D on (1,δ) to get (α^0,α^δ) and then doing the LSE of Y on (1,D^) where D^α^0+α^δδ, we can estimate βd consistently. This is the well-known two-stage LSE, which equals the IVE below.

Rewrite the D and Y structural form (SF) equations as

Di=Giα+εiand  Yi=Wiβ+Ui,where  Gi(1,δi),Wi(1,Di),α(α0,αδ),β(β0,βd).
With E1() denoting {E()}1, the IVE that is consistent for β is
(1NiGiWi)11NiGiYipE1(GW)E(GY)=E1(GW)E{G(Wβ+U)}=β+E1(GW)E(GU)=βas E(GU)=0.
Here, δ is an IV for D; we also say that G is an instrument (vector) for W. With no covariates X, call the IVE ‘simple IVE’.

(p.226) As can be seen in the last equation, an IV δ has to meet two necessary conditions:

COR(δ,D)0so that E1(GW)existsand  COR(δ,U)=0so that E(GU)=0;
the former is the ‘inclusion restrictionαδ0—the education should be effective in inducing exercises—and the latter holds due to the randomization of δ. An additional requirement should hold that δ do not enter the Y equation directly, which is an ‘exclusion restriction’—δ can influence Y only indirectly though D. In short, IV should meet three conditions: inclusion restriction, exclusion restriction, and zero correlation with the model error term.

Substituting the D SF into the Y SF, we get the Y reduced-form (RF) equation:

Y=β0+βd(α0+αδδ+ε)+U=(β0+α0βd)+αδβdδ+(U+βdε).
This shows that if we are interested only in ‘H0:βd0‘, then we can test if the slope of δ is zero in the LSE of Y on (1,δ) because αδ0—this LSE works because δ is exogenous. The Y RF shows in fact more: the slope of δ is
γδαδβd,
which is the product of the two effects: αδ for δ on D, and βd for D on Y. This shows yet another way of finding βd: do the LSE of D on (1,δ) to get α^δ and Y on (1,δ) to get γ^δ, and finally the ratio γ^δ/α^δ for βd.

So far, we introduced three ways of finding βd: two-stage LSE, simple IVE, and the ratio γ^δ/α^δ. Although they may look different, they are numerically the same. The fact that two-stage LSE and simple IVE are the same is well known (see, e.g., Lee 2010a), and the equivalence of the simple IVE slope and the ratio γ^δ/α^δ will be seen shortly under the name ‘Wald estimator’.

Suppose covariates X with COR(X,U)=0 appears in the D and Y equations as in

Di=α0+αδδi+Xiαx+εiand  Yi=β0+βdDi+Xiβx+Ui.
Under COR(δ,D|X)0 and COR(δ,U|X)=0, IVE takes the same form as the simple IVE, but with
Gi(1,δi,Xi),Wi(1,Di,Xi),α(α0,αδ,αx),β=(β0,βd,βx).
When there are more instruments than necessary so that G has more elements than W, a generalized version of IVE or ‘generalized method of moments’ (GMM) can be applied. More than enough IVs appear naturally when a conditional moment condition such as E(U|Z)=0 is available because functions of Z are candidate IVs.

Wald Estimator

In the LSE of Y on (1,D), the slope estimator equals the sample mean difference of the two groups D=0,1, which was proven in Chapter 1. There is an analogous relation as follows between ‘simple IVE with a binary instrument δ for a binary regressor D‘ and ‘the ratio of the group mean differences’ E(Y|δ=1)E(Y|δ=0) and E(D|δ=1)E(d|δ=0).

(p.227) Recall the simple IVE whose slope is consistent for

COV(Y,δ)COV(D,δ)=E(Yδ)E(Y)E(δ)E(Dδ)E(D)E(δ);
if D=δ, this equals the slope parameter COV(Y,δ)/V(δ) of the LSE of Y on (1,δ). Observe
E(Dδ)=E(D|δ=1)P(δ=1)and  D=D(δ+1δ).
Rewrite the denominator E(Dδ)E(D)E(δ) in the preceding display as
E(D|δ=1)P(δ=1)E{D(δ+1δ)}P(δ=1)=E(D|δ=1)P(δ=1)E(Dδ)P(δ=1)E{D(1δ)}P(δ=1)=E(D|δ=1)P(δ=1)E(D|δ=1)P(δ=1)2E(D|δ=0)P(δ=0)P(δ=1)=E(D|δ=1)P(δ=0)P(δ=1)E(D|δ=0)P(δ=0)P(δ=1).
Analogously, the numerator E(Yδ)E(Y)E(δ) equals
E(Y|δ=1)P(δ=0)P(δ=1)E(Y|δ=0)P(δ=0)P(δ=1).
Canceling P(δ=0)P(δ=1) that appears in both the denominator and numerator gives
COV(Y,δ)COV(D,δ)=E(Y|δ=1)E(Y|δ=0)E(D|δ=1)E(D|δ=0).

The sample version for the last ratio of the group mean differences is the Wald estimator:

(iYiδiiδiiYi(1δi)i(1δi))(iDiδiiδiiDi(1δi)i(1δi))1.
In the causal route δDY, the numerator of the Wald estimator is for the multiplicative indirect effect αδβd of δ on Y, and the denominator is for the effect αδ of δ on D; by the division, the direct effect βd of D on Y is recovered. This is the aforementioned equivalence of simple IVE to the LSE-based ratio γ^δ/α^δ.

In a clinical trial where δ is a random assignment and D is “compliance” if D=δ and “noncompliance” if Dδ, E(Y|δ=1)E(Y|δ=0) is called the ‘intent-to-treat effect’, because it shows the effect of treatment intention (i.e., assignment), not of the actual treatment received. Noncompliance to treatment dilutes the true effect, and the Wald estimator blows up the diluted effect with the factor {E(D|δ=1)E(D|δ=0)}1. This is the ‘rescaling’ role of the Wald estimator denominator.

So far, a constant treatment effect has been assumed that is the same for all individuals. If treatment effect is heterogeneous to vary across individuals, then IVE can be inconsistent. To see this, recall Y=Y0+(Y1Y0)D and suppose that the individual effect Y1Y0 is not a constant but Yi1Yi0=βd+Viwith E(V)=0. (p.228) Then

Y=Y0+(βd+V)D=βdD+(Y0+VD)=E(Y0+VD)+βdD+{Y0+VDE(Y0+VD)},
where E(Y0+VD) is the intercept and the term in {} is the error. The trouble is VD in the error term, because the instrument δ may be related to VD as COR(δ,D)0; if V⨿(D,δ), then IVE is consistent because E(δVD)=E(V)E(δD)=0. Since V is part of the treatment effect Y1Y0, ‘V⨿(D,δ)‘ would be questionable at best. Despite this problem due to heterogeneous effects, IVE is still consistent for an interesting parameter, as is shown next.

Wald Estimator for Effect on Compliers

Since δ affects D, we can imagine potential treatments (D0,D1) depending on δ=0,1, analogously to the potential responses (Y0,Y1) depending on D=0,1. We only observe (δ,D,Y), although (δ,D0,D1,Y0,Y1) is considered. Classify the individuals with (D0,D1), following Imbens and Angrist (1994) and Angrist et al. (1996):

D0=0,D1=0:never takers (of the treatment, no matter what δis);D0=0,D1=1:compliers (taking treatment only when δ=1);D0=1,D1=0:defiers (taking treatment only when δ=0);D0=1,D1=1:always takers (no matter what δis).

For the exercise (D) example, never-takers never exercise regardless of the education δ on benefits of exercise; compliers exercise only when educated (δ=1); defiers exercise only when not educated (δ=0); always-takers always exercise regardless of δ. We should observe both D0 and D1 to know the type of a person, but only one of D0 and D1 is observed; hence, the type is unknown. Since the grouping based on (D0,D1) is not affected by δ, this is a ‘principal stratification’ (Frangakis and Rubin 2002). In contrast, the membership for the D=0 or D=1 group changes as δ changes, so long as δ affects D. For instance, the compliers belong to D=0 (along with the never-takers) when δ=0, but they belong to D=1 (along with the always-takers) when δ=1.

Suppose

  1. 1. P(D=1|δ) changes when δ changes;

  2. 2. (Y0,Y1,D0,D1)⨿δ;

  3. 3. Either D0D1 or D0D1 (monotonicity).

Condition (a) is the inclusion restriction that δ is in the D equation to affect D. Condition (b) amounts to the exclusion restriction that δ is not in the Y0 and Y1 equations. Condition (c) is a monotonicity assumption.

One example in which the three conditions hold is

Yd=βd+Ud,d=0,1,D=1[α0+αδδ+ε>0],αδ0,δ⨿(ε,U0,U1).
(p.229) Here, Y1Y0=β1β0+V with VU1U0: the effect varies across individuals. Condition (a) holds due to αδ0. Condition (b) holds because δ is independent of (ε,U0,U1) and
(Y0,Y1,D0,D1)=(β0+U0,β1+U1,1[α0+ε>0],1[α0+αδ+ε>0]).
Condition (c) holds with or depending on αδ0. We can allow αδ to vary across individuals (say, αδi) without disturbing the above conditions, as long as all αδi‘s take the same sign. Without loss of generality, assume D0D1 to rule out defiers from now and onward.

Observe

E(Y|δ=1)E(Y|δ=0)=E{DY1+(1D)Y0|δ=1}E{DY1+(1D)Y0|δ=0}=E{D1Y1+(1D1)Y0|δ=1}E{D0Y1+(1D0)Y0|δ=0}(δ=E{D1Y1+(1D1)Y0}E{D0Y1+(1D0)Y0}(due to (b))=E{(D1D0)(Y1Y0)}=E(Y1Y0|D1D0=1)P(D1D0=1)(no defier implies P(D1D0=1)=0).

Since D1D0=1D1=1,D0=0 (complier), dividing the first and last expressions with P(D1D0=1) gives the effect on the compliers

E(Y1Y0|D1=1,D0=0)=E(Y|δ=1)E(Y|δ=0)P(D1=1,D0=0)=E(Y|δ=1)E(Y|δ=0)E(D|δ=1)E(D|δ=0),
which is the Wald estimator probability limit; the last equality holds because
E(D|δ=1)E(D|δ=0)=P(D=1|δ=1)P(D=1|δ=0)=P(always taker or complier)P(always taker)=P(complier).

The effect on compliers is also called the ‘local average treatment effect’ (LATE) (Imbens and Angrist 1994). The qualifier ‘local’ refers to the fact that LATE is specific to the instrument in use. Bear in mind that the LATE interpretation of the simple IVE (i.e., Wald estimator) requires the above three conditions, and that LATE can change as the instrument in use changes. If IVE changes as the instrument changes, then this is an indication for heterogeneous treatment effects (or some instruments may be invalid). IVE estimating the effect on those who change their behavior as the instrument value changes looks natural.

Abadie (2003) showed that the effect on the compliers can be written also as

E(YD|δ=1)E(YD|δ=0)E(D|δ=1)E(D|δ=0)E{Y(1D)|δ=1}E{Y(1D)|δ=0}E(1D|δ=1)E(1D|δ=0).
(p.230) The proof is similar to that for E{(D1D0)(Y1Y0)}. For the first term, observe
E(YD|δ=1)E(YD|δ=0)=E(Y1D1|δ=1)E(Y1D0|δ=0)=E(Y1D1)E(Y1D0)=E{Y1(D1D0)}=E(Y1|D1D0=1)P(D1D0=1);
E(Y1D0), not E(Y0D0), appears because D0=1 means treated. Divide the first and last expressions by E(D|δ=1)E(D|δ=0) to obtain
E(Y1|complier)=E(YD|δ=1)E(YD|δ=0)E(D|δ=1)E(D|δ=0).
Analogously, the second term holds due to the following:
E{Y(1D)|δ=1}E{Y(1D)|δ=0}=E{Y0(1D1)|δ=1}E{Y0(1D0)|δ=0}=E{Y0(1D1)}E{Y0(1D0)}=E{Y0(D1D0)}=E(Y0|D1D0=1)P(D1D0=1);E(1D|δ=1)E(1D|δ=0)=E(1D1)E(1D0)=P(complier).

A.3.3 Selection Correction Approach

Other than IVE, there are a number of ways to deal with unobserved confounders: sensitivity analysis, bounding method, and selection correction approach. Here we examine only selection correction approach for binary D, eschewing sensitivity analysis and bounding method that are not popular in practice. For the sensitivity analysis, interested readers can refer to Lee (2004), Altonji et al. (2005), Ichino et al. (2008), Lee and Lee (2009), Rosenbaum (2010), Huber (2014), and references therein. As for the bounding method, see Manski (2003), Tamer (2010), Choi and Lee (2012), Nevo and Rosen (2012), Chernozhukov et al. (2013), and references therein.

With W denoting covariates including X, suppose

Di=1[0<Wiα+εi],Yid=Xiβd+Uid,d=0,1,Wi=(Ci,Xi)εN(0,σε2)⨿W,E(Ud|W,ε)=σεdσε2ε,σεdE(εUd),ρεdCOR(Ud,ε),σd2V(Ud).
This model includes the exclusion restriction that C is excluded from the Y equation, which is not necessary, but helpful; see, for example, Lee (2010a). From the model, we obtain
τE(Y1Y0)=E(X)(β1β0),τdE(Y1Y0|D=d)=E(X|D=d)(β1β0)+E(U1|D=d)E(U0|D=d).
τ needs β1β0, and τ1 and τ0 need E(U1|D=d)E(U0|D=d) additionally.

(p.231) It holds that

E(Ud|W,D=1)=E(Ud|W,ε>Wα)=σεdσε2E(ε|W,ε>Wα)=σεdσεE(εσε|W,εσε>Wασε)=σεdσεϕ(Wα/σε)1Φ(Wα/σε)=ρεdσdϕ(Wα/σε)Φ(Wα/σε);E(Ud|W,D=0)=σεdσεϕ(Wα/σε)Φ(Wα/σε)=ρεdσdϕ(Wα/σε)Φ(Wα/σε).
From this,
τ1=E(X|D=1)(β1β0)+E{ϕ(Wα/σε)Φ(Wα/σε)|D=1}(ρε1σ1ρε0σ0),τ0=E(X|D=0)(β1β0)+E{ϕ(Wα/σε)Φ(Wα/σε)|D=0}(ρε1σ1+ρε0σ0).

The parameters can be estimated with ‘Heckman two-stage estimator’ (Heckman 1979) applied separately to the T and C groups. First, α/σε is estimated by probit α^, and then

LSE of (1D)Yon{(1D)X,(1D)ϕ(Wα^)Φ(Wα^)}and LSE of DYon {DX,Dϕ(Wα^)Φ(Wα^)}
yield estimates, respectively, for
γ0(β0,ρε0σ0)and  γ1(β1,ρε1σ1).
Let γ^d denote the LSE for γd, and let the asymptotic variance for N(γ^dγd) be Cd with C^dpCd. Stack the two estimates and parameters: γ^(γ^0,γ^1) and γ(γ0,γ1). Then the asymptotic variance of N(γ^γ) is Cdiag(C0,C1) and C^diag(C^0,C^1)pC. The online appendix has a program ‘SelCorcWorkOnVisit’ for the empirical example below, and the program shows how to obtain C^ with the first-stage probit error taken into account, drawing on Lee (2010a); if getting C^ looks hard, use nonparametric bootstrap.

Suppose X has the dimension k×1. Define

Qk×(2k+2)(Ik,0k,Ik,0k),where  0kis the k×1null vector.
An estimator for τ=E(X)(β1β0) and its asymptotic variance estimator are, respectively,
X¯Qγ^and1NX¯QC^QX¯.

(p.232) Let X¯d be the sample mean of X for the subsample D=d, and

Z¯d(k+1)×1[X¯d,1Ndi{D=d}ϕ{(2d1)Wiα^}Φ{(2d1)Wiα^}].
An estimator for τd and its asymptotic variance estimator are
Z¯dQdγ^and 1NZ¯dQdC^QdZ¯d,where Qd(k+1)×(2k+2)[Ik0kIk0k0k2d10k2d1];
2d1=1 if d=1 and 1 if d=0.

For a simple illustration, recall the example of work (D) effect on doctor office visits per year (Y) in Chapter 1. Setting W=X (no particular variable to exclude from X) and

X=(1,age,schooling,male,married,phy,psy),
where male and married are dummies and phy (psy) is self-assessed physical (psychological) condition in five categories (the lower the better). The estimated effects (t-value in ()) are
effect on the population:τ=E(Y1Y0)3.15(0.63),effect on the treated:τ1=E(Y1Y0|D=1)0.42(0.059),effect on the untreated:τ0=E(Y1Y0|D=0)11.23(3.60).
Interestingly, τ0 is significantly negative: making the nonworkers work will reduce doctor office visits by 11 per year. There are 31% nonworkers in the data.

A.4 Supplements for DD Chapter

This section supplements the DD chapter by presenting various nonparametric DD estimators and discussing ‘one-shot’ treatment. These were omitted in the main text to keep the DD chapter within a reasonable limit; for the same reason, we also put the clustered data issues in the TD chapter. Also, a generalization of DD, called ‘change in changes’, is reviewed near the end of this section.

(p.233) A.4.1 Nonparametric Estimators for Repeated Cross-Section DD

Recall the covariates W, the treatment qualification dummy Q and the sampling dummy S for the post-treatment period. Let

μ^11(w)jK{(Wjw)/h}QjSjYjjK{(Wjw)/h}QjSjpE(Y|W=w,Q=1,S=1),μ^10(w)jK{(Wjw)/h}Qj(1Sj)YjjK{(Wjw)/h}Qj(1Sj)pE(Y|W=w,Q=1,S=0),μ^01(w)jK{(Wjw)/h}(1Qj)SjYjjK{(Wjw)/h}(1Qj)SjpE(Y|W=w,Q=0,S=1),μ^00(w)jK{(Wjw)/h}(1Qj)(1Sj)YjjK{(Wjw)/h}(1Qj)(1Sj)pE(Y|W=w,Q=0,S=0).

Recall the W-conditional effects on the treated, untreated, and population:

E(Y31Y30|W3=w,Q=1),E(Y31Y30|W3=w,Q=0),E(Y31Y30|W3=w),
where w are to be integrated out using FW|Q=1,S=1=FW3|Q=1, FW|Q=0,S=1=FW3|Q=0, and FW|S=1=FW3 for the respective marginal effects. In view of this, consistent estimators for the effect on the treated, untreated, and population are, respectively,
1#{Q=1,S=1}i{Q=1,S=1}[μ^11(Wi)μ^10(Wi){μ^01(Wi)μ^00(Wi)}],1#{Q=0,S=1}i{Q=0,S=1}[μ^11(Wi)μ^10(Wi){μ^01(Wi)μ^00(Wi)}],1#{S=1}i{S=1}[μ^11(Wi)μ^10(Wi){μ^01(Wi)μ^00(Wi)}],
where #{} denotes the number of members in {}.

A.4.2 Nonparametric Estimation for DD with Two-Wave Panel Data

With only two periods (2 and 3 for before and after) in hand, nonparametric estimation can be done with ΔY3=Y3Y2 as a single response variable and W23(C,X2,X3) as the covariates; recall Wit(Ci,Xit). The resulting estimators are analogous to matching and nonmatching estimators in Chapters 2 and 3 for cross-section data. In this sense, this section may be taken as a “review” on nonparametric estimators for treatment effects. We present four estimators that appeared in Chapters 2 and 3: matching, weighting, regression imputation (RI), and complete pairing (CP). With only two waves, the time-constancy of Q does not matter, as Qi3 can be taken as Qi. In this subsection, we write

(ΔY,ΔY0,W,Q)instead of  (ΔY3,ΔY30,W23,Q3).

(p.234) Matching Estimators

A matching estimator for the effect on the treated under ID DD ΔY0Q|W is

μ^11N1tT{ΔYt1|Ct|cCtΔYc}
where N1=iQi, Ct is the comparison group for treated t based on a W-distance and |Ct| is the number of the controls in Ct. Henceforth, for simplification, we pretend that all ΔYt‘s are used although some treated individuals may be passed over in practice if no good matched controls are found (i.e., if |Ct|=0). The dimension problem in matching can be avoided with the propensity score π(W)P(Q=1|W) in place of W. As for the asymptotic inference, nonparametric bootstrap may be used.

A matching estimator for the effect on the untreated under ID DD ( Y 3 1 . Y 2 0 )Q|W is

μ^01N0cC{1|Tc|tTcΔYtΔYc},
where Tc is the comparison group for control c. We may define ΔY1Y31Y20 to rewrite ID DD as ΔY1Q|W, which would then go parallel with ID DD ΔY0Q|W.

Combining the two estimators, an estimator for the effect on the population under ΔY0Q|W and ( Y 3 1 . Y 2 0 )Q|W is

μ^N0Nμ^0+N1Nμ^1=1Ni=1N{(ΔYi1|Ci|cCiΔYc)1[iT]+(1|Ti|tTiΔYtΔYi)1[iC]}.
Propensity score matching can be also applied to the effects on the untreated and on the population.

Weighting Estimators

Recall, for cross-section data,

1P(D=1)E{Dπ(X)1π(X)Y}=E(Y1Y0|D=1),
which is the probability limit of the weighting estimator τ^1w in Chapter 3. To do analogously for DD, under ID DD, we just have to use
1P(Q=1)E{Qπ(W)1π(W)ΔY}=E{(Y31Y20)ΔY0|Q=1}=E(Y31Y30|Q=1).
Abadie (2005) used this to propose a number of nonparametric weighting estimators where π() is nonparametrically estimated. An estimator based on the first term of this (p.235) display for the effect on the treated is
μ^1wNN11NiQiπ^(Wi)1π^(Wi)ΔYi=1N1iQiπ^(Wi)1π^(Wi)ΔYi.

Analogously, the effects on the untreated is

1P(Q=0)E{Qπ(W)π(W)ΔY}=E(Y31Y30|Q=0),
and its sample version is
μ^0wNN01NiQiπ^(Wi)π^(Wi)ΔYi=1N0iQiπ^(Wi)π^(Wi)ΔYi.

The effect on the population is the weighted average of E(Y31Y30|Q=1) and E(Y31Y30|Q=0), an estimator for which is

μ^wN0Nμ^0w+N1Nμ^1w.

Regression Imputation Estimators

A RI estimator for the effect on the treated under ΔY0Q|W is

1N1tTE^(ΔY|Wt,Q=1)1N1tTE^(ΔY|Wt,Q=0)pE{E(ΔY|W,Q=1)|Q=1}E{E(ΔY|W,Q=0)|Q=1}=E{E(Y31Y20|W,Q=1)|Q=1}E{E(ΔY0|W,Q=0)|Q=1}=E{E(Y31Y20|W,Q=1)|Q=1}E{E(ΔY0|W,Q=1)|Q=1}=E(Y31Y30|Q=1).
To be specific on E^(ΔY|Wt,Q=d), we can use, for a kernel K and a bandwidth h,
jϵ{Q=d}K{(WjWt)/h}ΔYjjϵ{Q=d}K{(WjWt)/h}.

Analogous to the RI estimator for the effect on the treated are the following estimators for the effect on the untreated under (Y31Y20)Q|W and on the population under ΔY0Q|W and (Y31Y20)Q|W:

1N0cCE^(ΔY|Wc,Q=1)1N0cCE^(ΔY|Wc,Q=0)pE{E(ΔY|W,Q=1)|Q=0}E{E(ΔY|W,Q=0)|Q=0}=E{E(Y31Y20|W,Q=1)|Q=0}E{E(ΔY0|W,Q=0)|Q=0}=E{E(Y31Y20|W,Q=0)|Q=0}E{E(ΔY0|W,Q=0)|Q=0}=E(Y31Y30|Q=0);
(p.236)
1Ni=1NE^(ΔY|Wi,Q=1)1Ni=1NE^(ΔY|Wi,Q=0)pE{E(ΔY|W,Q=1)}E{E(ΔY|W,Q=0)}=E{E(Y31Y20|W)}E{E(ΔY0|W)}=E(Y31Y30).

Complete Pairing Estimator

A CP estimator for DD with continuously distributed W is

DD^231N0N1c=1N0t=1N1K(WtWch)(ΔYtΔYc)/1N0N1c=1N0t=1N1K(WtWch)pμ23{E(ΔY|W=w,Q=1)E(ΔY|W=w,Q=0)}ω23(w)w,
where ω23(w) is the product of density of W|Q=1 and density of W|Q=0 evaluated at w.

Suppose ΔY0Q|W and (Y31Y20)Q|W (or (ΔY30,Y31Y20)⨿Q|W23) hold so that we can drop Q in the integrand of μ23, which then becomes

E(Y31Y20|W=w)E(ΔY30|W=w)=E(Y31Y30|W=w).
Hence we will be estimating the effect on the population. CP can be done also with π(W) instead of W to avoid the dimension problem; that is, replace W in DD^23 with π^(W)=P^(Q=1|W) that is the estimated probit/logit probability. Nonparametric bootstrap or the CP asymptotic variance ignoring the π(W)-estimation error can be used for asymptotic inference.

A.4.3 Panel Linear Model Estimation for DD with One-Shot Treatment

Recall the covariates and treatment-interacting covariates for panel data DD:

Wit=(Ci,Xit)and  Git=(Ai,Hit)
where Ai consists of elements of Ci and Hit consists of elements of Xit. With one-shot treatment at t=τ, we have Dit=Qi1[t=τ] and the model becomes, for t=0,,T,
Yit0=βt+βqQi+βqcQiCi+βwWit+Vit,Vit=δi+UitYit1=Yit0+βd1[t=τ]+βdg1[t=τ]GitYit=βt+βdQi1[t=τ]+βdgQi1[t=τ]Git+βqQi+βqcQiCi+βwWit+Vit.
The treatment effect is βd+βdgGτ only at period t=τ, and zero at the other periods.

(p.237) Observe

Δβt=Δβ1+(Δβ2Δβ1)1[t=2]+,,+(ΔβTΔβ1)1[t=T];ΔQi1[t=τ]=Qi1[t=τ]Qi1[t1=τ]=Qi(1[t=τ]1[t=τ+1]);Δ(βdgQi1[t=τ]Git)=Δ(βdgQi1[t=τ]Giτ)=βdgQiGiτ(1[t=τ]1[t=τ+1]).
Using these, difference the Yit equation to obtain
ΔYit=Δβ1+(Δβ2Δβ1)1[t=2]+,,+(ΔβTΔβ1)1[t=T]+βdQi(1[t=τ]1[t=τ+1])+βdgQiGiτ(1[t=τ]1[t=τ+1])+βxΔXit+ΔUit.

To implement DD with this, LSE can be applied under

E(ΔUt|Gτ,ΔXt,Q)=0t=1,,T.
With no interacting covariates (βdg=0), the LSE is for ΔYit only on
1,1[t=2],,1[t=T],Qi(1[t=τ]1[t=τ+1]),ΔXit;
the slope of Qi(1[t=τ]1[t=τ+1]) is the desired effect βd.

To implement the LSE with t=0,1,2,3,4,5 and τ=3, observe

[ΔYi1ΔYi2ΔYi3ΔYi4ΔYi5]=[1000000ΔXi11100000ΔXi210100QiGi3QiΔXi310010QiGi3QiΔXi41000100ΔXi5][Δβ1Δβ2Δβ1Δβ3Δβ1Δβ4Δβ1Δβ5Δβ1βdβdgβx]+[ΔUi1ΔUi2ΔUi3ΔUi4ΔUi5];
the ΔYi3, ΔYi4, and ΔYi5 equations are obtained by setting t=3,4,5 in the ΔYit equation:
ΔYi3=Δβ1+(Δβ3Δβ1)+βdQi+βdgQiGi3+βxΔXi3+ΔUi3,ΔYi4=Δβ1+(Δβ4Δβ1)βdQiβdgQiGi3+βxΔXi4+ΔUi4,ΔYi5=Δβ1+(Δβ5Δβ1)+βxΔXi5+ΔUi5.

(p.238) A.4.4 Change in Changes

For repeated cross-sections, Athey and Imbens (2006) proposed a DD without assuming the additivity of the effects. In this section, we review their approach, which is hard to use in practice, not least because of the difficulty in allowing for covariates.

Athey and Imbens assumed that the untreated response Y0 is determined by

Yi0=h(Ui,Si)
for an unknown function h(u,s) strictly increasing in u; recall that S=1 is the dummy for being sampled in the treated period. In this model, Y0 is the same across the treatment and control groups because the group dummy Q does not enter h directly, although Q may enter indirectly; for example, U=(1Q)U0+QU1 for group 0 and 1 error terms U0 and U1.

Further assume

(i):U⨿S|Q(ii):the support of U1is a subset of the support of U0.
The assumption (i) is a weaker version of S being independent of the other random variables, and (ii) is needed to construct the desired counterfactual Y0|(Q=1,S=1) using the identified distributions of the control group’s before and after responses,Y0|(Q=0,S=0) and Y0|(Q=0,S=1). To simplify exposition, assume that U is continuously distributed.

The main challenge in any cross-section DD is constructing the counterfactual Y0|(Q=1,S=1). If we knew h(,) in Y0=h(U,S), then we would be able to find Y0|(Q=1,S=1) by using the inverse function h1(;S) of h(U,S) for a given S. Specifically, first, find U using

U=h1(Y0;0)for S=0.
Second, with this U, the desired counterfactual Y0|(Q=1,S=1) would be
h(U,1)=h{h1(Y0;0),1}.
Although h is not observed and thus this scenario does not work, it is enough to know the distribution of Y0|(Q=1,S=1) to obtain its mean as follows, not the individual Yi0|(Qi=1,Si=1) for each i.

With FY0,jk denoting the distribution function of FY0|Q=j,S=k, the key equation in Athey and Imbens (2006) that gives the desired counterfactual distribution is

FY0,11(y)=FY,10[FY,001{FY,01(y)}];
Y on the right-hand side equals Y0 because (10), (00), and (01) groups are all untreated. Using this, the desired counterfactual mean E(Y0|Q=1,S=1) is (p.239)
yFY0,11(y)=yFY,10[FY,001{FY,01(y)}]=FY,011{FY,00(y)}FY,10(y),setting  y=FY,011{FY,00(y)}FY,001{FY,01(y)}=y.

Using the last expression FY,011{FY,00(y)}FY,10(y) for E(Y0|Q=1,S=1), Athey and Imbens proposed “change in changes (CC)”:

E(Y|Q=1,S=1)E[FY,011{FY,00(Y)}|Q=1,S=0]=E(Y1|Q=1,S=1)E[FY,011{FY,00(Y0)}|Q=1,S=0]=E(Y1|Q=1,S=1)E(Y0|Q=1,S=1)=E(Y1Y0|Q=1,S=1),
which is the effect on the treated at the post-treatment era.

A sample version of CC is

1#{Q=1,S=1}i{Q=1,S=1}Yi1#{Q=1,S=0}i{Q=1,S=0}F^Y,011{F^Y,00(Yi)},
where F^Y,jk is the empirical distribution function of Y|(Q=j,S=k). Covariates can be allowed by conditioning the above means and distributions on X (i.e., using only the observations sharing the same value of X in the sample version), but this is hardly practical.

A simple intuitive way to understand CC is using “quantile time effect.” Recall that the main task in DD is constructing the counterfactual untreated response with only the time effect in the treatment group. More specifically, the task is finding the time effect using only the control group first, and then adding it to the treatment group’s pretreatment response, under the identification condition that the time effect is the same across the two groups.

To see how the time effect can be extracted using the control group, use the “p-quantile time effect” for the control group:

FY,011(p)FY,001(p)=FY,011{FY,00(y)}FY,001{FY,00(y)}(with the p-quantile yof FY,00p=FY,00(y))=FY,011{FY,00(y)}y.
Without the time effect, this equation becomes yy=0 because FY,01=FY,00. That is, the function
yFY,011{FY,00(y)}
gives the time-effect-augmented untreated response. Hence, putting Y|(Q=1,S=0) into this function gives the time-effect-augmented version Y0|(Q=1,S=1) as was seen in CC:
E[FY,011{FY,00(Y)}|Q=1,S=0].
(p.240)