## Myoung-jae Lee

Print publication date: 2016

Print ISBN-13: 9780190258733

Published to British Academy Scholarship Online: August 2016

DOI: 10.1093/acprof:oso/9780190258733.001.0001

Show Summary Details

# (p.209) A Appendix

Source:
Matching, Regression Discontinuity, Difference in Differences, and Beyond
Publisher:
Oxford University Press

This appendix examines prerequisite or supplementary topics, ordered in the way they are needed if the main text chapters are read sequentially. First, kernel nonparametric estimation is reviewed, the knowledge of which is helpful for matching and RD. Second, bootstrap is introduced, which will be useful for most estimators including matching estimators. Third, unobserved confounders and treatment endogeneity are discussed, which will be needed mostly for RD; although these are in the appendix because they do not fit the main theme of this book, their importance cannot be overstated. Fourth, “left-over” topics for DD are discussed.

# A.1 Kernel Density and Regression Estimators

## A.1.1 Histogram-Type Density Estimator

Suppose $X1,…,XN$ are iid random variables with df $F(x)$ and density function $f(x)$ that is bounded and continuously differentiable. Consider a $B(N,π)$ random variable (binomial with $N$ trials and the success probability $π$) where $h>0$ is a bandwidth:

$Display mathematics$
It holds that
$Display mathematics$

A histogram-type density estimator for $f(x)$ with interval size $2h$ is

$Display mathematics$
Without $2h$ in the denominator, $f¯(x)$ would be a histogram showing the proportion of observations falling in $x±h$. Since $f¯(x)$ is an estimator for $f(x)$ without parameterizing $f(x)$ such as normal or logistic, $f¯(x)$ is a ‘nonparametric estimator’ for $f(x)$. For instance, if we know $X∼N(μ,σ2)$ with unknown parameters $μ$ and $σ2$, then $f(x)$ (p.210) can be estimated parametrically with
$Display mathematics$
In the following, we show $f¯(x)→pf(x)$; see, for example, Lee (2010a) for more on nonparametric density and regression estimation. Be mindful about the distinction between the data $X1,…,XN$ and the evaluation point $x$ that does not have to equal any of $X1,…,XN$.

Using the mean and variance for $∑i1[|Xi−x|, we get

$Display mathematics$
With $π=F(x+h)−F(x−h)$, Taylor expansion yields, for some $x∗$ and $O(h2)$ essentially denoting a constant times $h2$,
$Display mathematics$

Substitute $π/2h=f(x)+O(h)$ into the above $E{f¯(x)}$ and $V{f¯(x)}$ to obtain

$Display mathematics$
Let $h→0$ and $Nh→∞$ as $N→∞$. Then, as $N→∞$,
$Display mathematics$
$h→0$ makes the bias $E{f¯(x)}−f(x)$ disappear, whereas $Nh→∞$ makes the variance disappear. This implies $f¯(x)→pf(x)$, because $f¯(x)→pE{f¯(x)}$ due to an LLN and $E{f¯(x)}→f(x)$.

## A.1.2 Kernel Density Estimator

The role of the indicator function $1[|Xi−x|/h<1]$ in $f¯(x)$ is weighting the $i$th observation: the weight is $1$ if $Xi$ falls within the $h$-distance from $x$, and $0$ otherwise. Generalizing this weighting idea, we can think of a smooth weight depending on $|Xi−x|$. Let $X$ be now a $k×1$ vector; in this case, $h$-distance becomes $hk$-distance; for example, the two dimensional analog of the interval $(x−h,x+h)$ is the rectangle around $x$ of size $(2h)2=4h2$.

A ‘kernel density estimator’ is based on the smooth weighting idea:

$Display mathematics$
where $K$ (called a ‘kernel’) is a smooth multivariate density that is symmetric about $0$ (e.g., the $N(0,Ik)$ density). The kernel estimator $f^(x)$ includes $f¯(x)$ as a special case when the ‘uniform kernel’ $K(z)=1[|z|<1]/2$ is used with $k=1$.

(p.211) Analogous to the earlier proof for $f¯(x)→pf(x)$, we can show that $f^(x)→pf(x)$. Furthermore, under some regularity conditions,

$Display mathematics$
which can be used to construct (point-wise) confidence intervals (CI) for $f(x)$. Although $∫K(z)∂z=1$, $∫K(z)2∂z≠1$ in general: we have to find $∫K(z)2∂z$ for CIs. One simple way to find $∫K(z)2∂z$ is using a ‘Monte Carlo integral’: generating $Z1,…,Zn$ iid $N(0,1)$, it holds that
$Display mathematics$
the pseudo-sample average (i.e., the first term) can be used for $∫K(z)2∂z$.

Other than the uniform kernel and the $N(0,1)$ kernel $ϕ(.)= (2π) −1/2 exp{ − (.) 2 /2 }$, popular kernels are

$Display mathematics$
For $k>1$, ‘product kernels’ consisting of copies of univariate kernels can be used; for example, $K(z)=∏j=1kϕ(zj)$.

The scalar $h$ is a ‘bandwidth’ or ‘smoothing parameter’, whose role is analogous to that of the interval size in a histogram. If $h$ is too small, there is no grouping (averaging) and $f^(x)$ will be too jagged as $x$ varies (a small bias but a large variance). If $h$ is too large, $f^(x)$ will show little variation (a large bias but a small variance). As for histogram interval size, there is no “the best rule” for choosing $h$ in practice. When $k=1,2$, the best strategy is visual inspection: choose $h$ such that the graph $x⟼f^(x)$ is neither too jagged nor too smooth; if anything, slightly undersmooth.

A practical rule of thumb for choosing $h$ is $h≃ν⋅N−1/(k+4)$ with, say, $0.5≤ν≤3$ if $k$ is $1$ or $2$ with all components of $X$ standardized. For example, if $K(z)=∏j=1kϕ(zj)$ with $z=(z1,…,zk)′$ used, then,

$Display mathematics$
More discussion on choosing $K$ and $h$ appears below.

## A.1.3 Kernel Regression Estimator

A kernel regression estimator $ρ^(x)$ for $ρ(x)≡E(Y|X=x)$ in

$Display mathematics$
(p.212) is
$Display mathematics$
where the numerator of $ρ^(x)$ is defined as $g^(x)$. Rewrite $ρ^(x)$ as
$Display mathematics$
to see that $ρ^(x)$ is a weighted average of $Yi$‘s where the weight is large if $Xi$ is close to $x$ and small otherwise.

Similarly for $f^(x)→pf(x)$, it can be shown that

$Display mathematics$
which implies, when combined with $f^(x)→pf(x)$,
$Display mathematics$
Analogously to the asymptotic normality of $(Nhk)1/2{f^(x)−f(x)}$, it holds under some regularity conditions that
$Display mathematics$
$V(U|x)=E(U2|x)$ can be estimated using the residual $U^i≡Yi−ρ^(Xi)$:
$Display mathematics$

To implement kernel estimation, one has to choose $K$ and $h$. As for $K$, it is known that the choice of kernel makes little difference. But the story is quite different for $h$, because the choice of $h$ makes a huge difference. When $k=1$ or $2$, other than the above rule of thumb, a good practical method is drawing $f^(x)$ or $ρ^(x)$ over a reasonable range for $x$ and choosing $h$ such that the curve estimate is not too smooth (if $h$ is too big) nor too jagged (if $h$ is too small), as in choosing $h$ for density estimation. If $k>2$, this balancing act is hard to do. In this case, it is advisable to choose $h$ by the following scheme.

Consider minimizing for $h$:

$Display mathematics$
$ρ^−i(Xi)$ is a ‘leave-one-out estimator’ for $ρ(Xi)$. This method of choosing $h$ is called ‘cross-validation (CV)’, which works well in practice. For estimating $f(x)$, $Yi$ is irrelevant, and a CV minimand for $h$ is
$Display mathematics$
where $K(2)(a)≡∫K(a−z)K(z)∂z$.

## (p.213) A.1.4 Local Linear Regression

The kernel estimator $ρ^(x)$ can be obtained by minimizing the following with respect to (wrt) $a$:

$Display mathematics$
$ρ^(x)$ may be viewed as predicting $Yi$ locally around $x$ using only the intercept $a$. A variation of $ρ^(x)$ is obtained using a line (an intercept and a slope) centered at $x$, which is local linear regression (LLR) minimizing
$Display mathematics$
wrt $a$ and $b$. The intercept estimator $a^(x)$ for $a$ is the LLR estimator for $ρ(x)$, whereas the slope estimator $b^(x)$ for $b$ is the LLR estimator for the derivative $∂ρ(x)/∂x$. See Fan (1996) for LLR in general.

To be specific, the LLR estimator for $ρ(x)$ is ($01×k$ is the $1×k$ null vector)

$Display mathematics$
where $Y≡(Y1,…,YN)′$, $W(x)≡diag{K((X1−x)/h),…,K((XN−x)/h)}$, and
$Display mathematics$

Compared with the LLR estimator $a^(x)$, the usual kernel estimator $ρ^(x)$ may be called the ‘local constant regression (LCR)’ estimator. Relatively speaking, LLR is less biased but more variable than LCR; this is the classic trade-off between bias and variance. The advantage of being less biased in LLR tends to be visible around boundary points of $X$ support and peaks and troughs of $E(Y|X=x)$.

In Figure A.1 with $N=200$, we generated $Y$ with $Y=X−X2+U$ where $U,X∼N(0,1)$ with $U⨿X$, and LCR and LLR estimates were obtained with $K=ϕ$ (the $N(0,1)$ kernel). In the left panel, we set $h=0.5×SD(X)N−1/5$, and $h=2.5×SD(X)N−1/5$ in the right panel is five times greater than the $h$ in the left panel. The solid lines are the true regression function $E(Y|X)=X−X2$, whereas the dashed and dotted lines are the LCR and LLR estimates, respectively. In the left panel, LCR and LLR are almost the same, and both are undersmoothed in view of the wiggly parts. In the right panel, both are oversmoothed with the larger bandwidth, but LCR is more biased than LLR as LCR clearly oversmooths the peak.

# A.2 Bootstrap

This section reviews bootstrap, drawing on Lee (2010a) who in turn drew on Hall (1992), Efron and Tibshirani (1993), Shao and Tu (1995), (p.214) Davison and Hinkley (1997), Horowitz (2001), and Efron (2003). See also Van der Vaart (1998), Lehmann and Romano (2005), and DasGupta (2008). In the main text, we mentioned ‘nonparametric (or empirical) bootstrap’ many times to simplify asymptotic inference. Hence, before embarking on the review of bootstrap in general, we quickly explain nonparametric bootstrap in the following.

Figure A.1 $E(Y|X)$ vs. LCR and LLR Estimates

Given an original sample of size $N$ and an estimate $α^N$ for a parameter $α$, (i) resample from the original sample with replacement to construct a pseudo sample of size $N$; (ii) apply the same estimation procedure to the pseudo sample to get a pseudo estimate $α^Nb$; (iii) repeat this $B$ times (e.g., $B=500$—the higher the better) to obtain $α^Nb$, $b=1,…,B$; (iv) use quantiles for each component of $α^N(b)$ to construct a confidence interval (CI) for the corresponding component of $α$; for example, the 0.025 and 0.975 quantiles for the second components of $(α^N1,…,α^NB)$ gives a 95% CI for the second component of $α$.

Instead of CIs, sometimes the variance estimator $B−1∑b=1B(α^Nb−α^N)(α^Nb−α^N)′$ is used as an asymptotic variance for $α^−α$. Although CIs from the bootstrap are consistent as long as the estimation procedure is “smooth,” the consistency of the variance estimator is not known in general.

In the online appendix, the program ‘BootAvgSim’ illustrates how to do nonparametric bootstrap (as well as ‘bootstrap percentile-t method’ to be explained below) for mean. The program ‘RegImpPsNprSim’ shows how to implement nonparametric bootstrap in regression imputation approach, which can be easily modified for other approaches’ bootstrap.

## (p.215) A.2.1 Review on Usual Asymptotic Inference

Statistical inference is conducted with CI and hypothesis test (HT). For a $k×1$ parameter $β$ and an estimator $bN→pβ$, CI and HT are done using the asymptotic distribution of a transformation of $bN$: in most cases, for some variance $V$,

$Display mathematics$
The test statistic (TS) $NV−1/2(bN−β)$ is asymptotically pivotal because its asymptotic distribution is a known distribution as in $N(0,Ik)$.

To do inference with CI, note $N(t′bN−t′β)⇝N(0,t′Vt)$ for a known $k×1$ vector $t$. With $ζα$ denoting the $α$ quantile of $N(0,1)$ and $VN→pV$, as $N→∞$,

$Display mathematics$
This gives a CI for $t′β$; for example, $t=(0,…,0,1)′$ and $α=0.05$ yields a symmetric asymptotic 95% CI for $βk$. For $H0:t′β=c$ for a specified value of $c$ (typically $c=0$), we reject the $H0$ if $c$ is not captured by the CI. The false rejection probability (i.e., the type I error) is $α$.

Alternatively to using CI, we can use an asymptotically pivotal TS to conduct a HT: if the realized value of the TS is “extreme” for the known asymptotic distribution under $H0$, then the $H0$ is rejected. For instance, under $H0:t′β=c$, we can use

$Display mathematics$
For two-sided tests, we choose the critical region $(−∞,−ζ1−α/2)$ and $(ζ1−α/2,∞)$, and reject $H0$ if the realized value of the TS falls in the critical region (with the false rejection probability $α$). A better way might be looking at the p-value
$Display mathematics$
to reject the $H0$ if the p-value is smaller than $α$. For one-sided test, this HT scenario requires minor modifications.

Although CI and HT are equivalent to (i.e., “dual” to) each other in the case of using $N(bN−β)⇝N(0,V)$, there are many HTs whose corresponding CIs are hard to think of. For instance, $H0:$ the distribution of $Y$ is symmetric about $0$, or $H 0 :E( Y 4 )=3E( Y 2 )$.

## (p.216) A.2.2 Bootstrap to Find Quantiles

Define the exact distribution function (df) for a statistic $TN(F)$:

$Display mathematics$
where $F$ is the distribution for the original sample and $VN$ is a ‘scaling number (matrix)’. Regard $β$ as a scalar for simplification. Keep in mind the distinction between a (probability) distribution and its df; a df is just a deterministic function.

We desire $GN(c;F)$: how $TN(F)$ behaves with a given sample of size $N$ when the sample is drawn from the true distribution $F$. The last display makes it explicit that the exact, not asymptotic, distribution of $TN(F)$ depends on the underlying distribution $F$. The usual large sample inference in the preceding section uses the approximation (the ‘asymptotic df’ of $TN(F)$) for $GN(c,F)$:

$Display mathematics$
Often $TN(F)$ is asymptotically pivotal: $G∞(c;F)$ does not depend on $F$; for example, $G∞(c,F)=P{N(0,Ik)≤c}$. We may then write just $G∞(c)$ instead of $G∞(c;F)$. In this case, the large sample approximation $G∞(c;F)$ to $GN(c;F)$ is done only through one route (“through the subscript”). “Two-route” approximation is shown next.

Suppose $TN(F)$ is not asymptotically pivotal; for example, $G∞(c,F)=Φ{c/σ(F)}$ where the parameter of interest is the mean and $σ(F)$ is the SD. In this nonpivotal case, the nuisance parameter $σ(F)$ should be replaced by an estimator, say, $sN≡σ(FN)$. In a case like this with an asymptotically nonpivotal $TN(F)$, $G∞(c,FN)$ is used as a large sample approximation for $GN(c;F)$ due to the estimated nuisance parameter: two routes of approximation are done between $GN(c;F)$ and $G∞(c,FN)$, through the subscript $∞$ and $FN$.

Suppose that $GN(c,F)$ is smooth in $F$ in the sense

$Display mathematics$
recall that the empirical distribution $FN$ gives probability $N−1$ to each observation $Zi$, $i=1,…,N$. Bootstrap uses $GN(c;FN)$ as an approximation to $GN(c;F)$ where the approximation is done only through $FN$. This is in contrast to the large sample approximation $G∞(c)$ or $G∞(c,FN)$ to $GN(c,F)$.

Whether the last display holds depends on the smoothness of $GN(c;F)$ as a functional of $F$. This also shows that consistent estimators for $F$ other than $FN$ (e.g., a smoothed version of $FN$) may be used in place of $FN$. This is the basic bootstrap idea: replace $F$ with $FN$ and do the same thing as with $F$. Since the smoothness of $GN(c,F)$ is the key ingredient for bootstrap, if the “source” $TN(F)$ is not smooth in $F$, bootstrap either will not work as well (e.g., quantile regression is “one-degree” less smooth than LSE, and bootstrap works for quantile regression in a weaker sense than for LSE) or does not work at all. Bear in mind the different versions of $G$ that appeared so far:

Nonoperational

Operational

Finite Sample

$GN(c;F)$ for target

$GN(c;FN)$in bootstrap

Asymptotic

$G∞(c;F)$

$G∞(c)$ (pivotal); $G∞(c;FN)$ (nonpivotal)

(p.217) Using $GN(c;FN)$ means treating the original sample $(Z1,…,ZN)$ as the population—that is, the “population distribution” is multinomial with $P(Z=Zi)=N−1$. Specifically, with $F$ replaced by $FN$, we have

$Display mathematics$
and $β(FN)$ is the parameter for the empirical distribution. For instance, suppose $β(F)=E(Z)=∫z∂F(z)$ and the estimator for $β$ is the sample mean $bN=Z¯$. Considering a pseudo sample $Z1∗,…,ZN∗$ drawn from $FN$ with replacement—some observations in the original sample get drawn multiple times while some get never drawn—we have
$Display mathematics$
This example illustrates that bootstrap approximates the distribution of (scaled) $Z−E(Z)$ with that of (scaled) $Z¯∗−Z¯$. That is, the relationship of $bN=Z¯$ to $β=E(Z)$ is inferred from that of $bN∗=Z¯∗$ to $bN=Z¯$.

$GN(c;FN)$ may look hard to get, but it can be estimated as precisely as desired because $FN$ is known. One pseudo sample of size $N$ gives one realization of $TN(FN)$. Repeating this $NB$ times yields $NB$-many pseudo realizations, $bN∗(1),…,bN∗(NB)$. Due to a LLN applied with the “population distribution $FN$ for the pseudo sample”, we get

$Display mathematics$
(p.218) This convergence is ‘in probability’ or ‘a.e.’ conditional on the original sample $Z1,…,ZN$. Hence there are two phases of approximation in bootstrap: the first is with $NB→∞$ for a given $N$ (as in this display), and the second is with $N→∞$ for $GN(c;FN)−GN(c;F)→p0$. Since we can increase $NB$ as much as we want, we can ignore the first phase of approximation to consider the second phase only. This is the bootstrap consistency that we take as a fact here: quantiles found from the pseudo estimates are consistent for the population quantiles.

## A.2.3 Percentile-t and Percentile Methods

Suppose $TN=VN−1/2N(bN−β)$ is asymptotically pivotal. Using bootstrap quantiles $ξN,α/2$ and $ξN,1−α/2$ of $TN∗(1),…,TN∗(NB)$, we can construct a $(1−α)100%$ bootstrap CI for $β$:

$Display mathematics$
This way of constructing a CI with an asymptotically pivotal $TN$ is called percentile-t method—‘percentile’ because percentiles (i.e., quantiles) are used and ‘t’ because $TN$ takes the form of the usual $t$-value that is asymptotically pivotal.

There is also percentile method using $bN$. Define the exact df for $bN$ as

$Display mathematics$
The bootstrap estimator for $JN(c,F)$ is $NB−1∑j=1NB1[bN∗(j)≤c]$. Denoting the empirical df of $bN∗(1),…,bN∗(NB)$ as $KN∗$, a $(1−α)100%$ CI for $β$ is
$Display mathematics$
Different from percentile-t method is that quantiles of $bN∗(1),…,bN∗(NB)$ are used, not quantiles of $TN∗(1),…,TN∗(NB)$. One disadvantage with this CI is that $bN$ may fall outside the CI (or near one end of the CI). To avoid this problem, sometimes a ‘bias-corrected CI’ gets used as in the following paragraph.

A two-sided $(1−α)100%$ bias-corrected CI when the asymptotic distribution is normal is, with $Φ$ being the $N(0,1)$ df,

$Display mathematics$
If $bN$ is the median among the pseudo estimates so that $KN∗(bN)=0.5$, then $Φ−1(KN∗(bN))=0$: the bias-corrected CI reduces to the preceding ${KN∗−1(α/2),KN∗−1(1−α/2)}$. If $bN$ is smaller than the pseudo estimate median, then $KN∗(bN)<0.5$, and $Φ−1(KN∗(bN))<0$:the bias-corrected CI shifts to the left so that $bN$ moves to the center of the CI.

A natural question at this stage is why bootstrap inference might be preferred to the usual asymptotic inference. First, in terms of convenience, as long as the computing power allows, bootstrap is easier to use as it just repeats the same estimation procedure $NB$ times, which makes bootstrap a “no-brain” method. Second, estimating asymptotic (p.219) variance may be difficult, which bootstrap avoids. Third, the bootstrap approximation error is equal to or smaller than the asymptotic approximation error; for example,

$Display mathematics$
For asymmetric CIs, the smaller-order approximation holds only for percentile-t method; for symmetric CI, it holds for both percentile-t and percentile methods. Whenever possible, use percentile-t bootstrap based on a pivotal statistic.

## A.2.4 Nonparametric, Parametric, and Wild Bootstraps

Hypothesis testing can be done with bootstrap CIs (or confidence sets), but sometimes CIs are inappropriate—for example, various model goodness-of-fit tests. In such cases, the issue of bootstrap test appears. The key issue in bootstrap test is how to impose the null hypothesis in generating pseudo samples. Although we only mentioned sampling from the original sample with replacement so far—this is nonparametric/empirical bootstrap’—bootstrap test brings about a host of other ways to generate pseudo samples, depending on how the null hypothesis is imposed.

To appreciate the importance of imposing $H0$ on pseudo samples, suppose ‘$H0:F$ is $N(0,1)$‘. Under the $H0$, nonparametric bootstrap would yield a pseudo sample consisting of “nearly” $N(0,1)$ random variables, and the test with nonparametric bootstrap would work because the realized TS for the original sample will be similar to the pseudo sample TSs. Now suppose that the $H0$ is false because the true model is $N(5,1)$. In this case, we want to have the realized TS to be much different from the pseudo TSs so that the bootstrap test rejects. If we do not impose the $H0$ in generating the pseudo samples, then both the original data and pseudo samples will be similar because they all follow more or less $N(5,1)$, resulting in no rejection. But if we impose ‘$H0:$ $F$ is $N(0,1)$‘ on the pseudo samples, then the realized TS for the original sample (centered around $5$) will differ greatly from the TSs from the pseudo sample (centered around $0$), leading to a rejection.

Suppose $H0:f=fo(θ)$; that is, the null model is parametric with an unknown parameter $θ$. In this case, $θ$ may be estimated by the MLE $θ^$, and the pseudo data can be generated from $fo(θ^)$. This is parametric bootstrap where imposing the $H0$ on pseudo data is straightforward. For instance, if $H0:F=Φ$ in binary response, then (i) $θ$ in $X′θ$ can be estimated with probit $θ^$, (ii) a pseudo observation $Xi∗$ can be drawn from the empirical distribution of $X1,…,XN$, and (iii) $Yi∗$ can be generated from the binary distribution with $P(Yi∗=1|Xi∗)=Φ(Xi∗′θ^)$.

Often we have the null model that is not fully parametric, in which case parametric bootstrap does not work and this makes imposing the null on pseudo data far from straightforward. For instance, the null model may be just a linear model $Yi=Xi′β+Ui$ without the distribution of $(X,U)$ specified. In this case, one way of imposing the null goes as follows. Step 1: sample $Xi∗$ from the empirical distribution of $X1,…,XN$. Step 2: sample a residual $U^i∗$ from the empirical distribution of the residuals $U^i≡Yi−Xi′bN$, $i=1,…,N$. Step 3: generate $Yi∗≡Xi∗bN+U^i∗$. Repeat this $N$ times to get a pseudo-sample of size $N$.

(p.220) In the bootstrap scheme for the linear model, $U^∗$ is drawn independently of $X$, which is fine if $U⨿X$. But if we want to allow for heteroskedasticity, this bootstrap does not work because $U^∗$ is generated independently of $X$; instead, wild bootstrap is suitable: with $Xi∗=Xi$, generate $Yi∗=Xi∗′bN+Vi∗U^i$ where $Vi∗$ takes $±1$ with probability $0.5$. Since $E(V∗)=0$ and $E(V∗2)=1$, we get

$Display mathematics$
preserving the heteroskedasticity in the pseudo sample.

# A.3 Confounder Detection, IVE, and Selection Correction

## A.3.1 Coherence Checks

Typically, a causal analysis for effects of $D$ on $Y$ starts with finding an association between them. Then we control for observed variables $X$ to see whether the association still stands firm despite $X$ taken out of the picture; in a sense, we try to “negate” the causal claim using $X$. For unobserved variables $ε$, we cannot do the same; instead, we try to prop up the prima facie causal finding by showing that the finding is coherent, that is, consistent with auxiliary findings. Sometimes we have an idea on variables lurking in $ε$, and sometimes we do not. In either case, there are a number of ways to show coherence.

Suppose that a positive effect has been found initially. One would expect that if the treatment level is increased, say to the double dose, then the effect will become stronger. Likewise, if the treatment is reduced to half the dose, then the effect will become weaker. Furthermore, if the treatment is reversed, a negative effect will occur. Of course, the true relation between $D$ and $Y$ can be highly nonlinear, being negative or positive depending on the level of $D$. Barring such cases, however, confirming those expectations supports the initial causal finding. If the expectations do not hold up, then the initial causal finding is suspect: it might have been due to some $ε$. Instead of using an extra treatment group with double/half/reverse treatment, using another response not supposed to be affected by $D$ or another control group supposed to be similar to the original control group also can help detect the presence of $ε$. Examples for these appear below.

### Partial Treatment

Suppose we examine job-training effects on reemployment or not within certain days (e.g., reemployment within 100 days). The T group get the job training and the C group do not. Suppose there is a dropout group (“D” group receiving only part of the required training). The three groups are ordered in terms of treatment dose: C group with no training, D group with partial training, and T group with the full training. If there is no unobserved confounder $ε$, what is expected (with $X$ controlled) is

$Display mathematics$
(p.221) where “$C≻D$“ means that the reemployment probability is greater for the C group than for the D group.

Suppose that the observed finding is $D≻C≻T$. There are many possible scenarios for this. One is that the training is harmful, but smart trainees see this and thus drop out; the D group find jobs sooner than the C group because the D group is smarter, resulting in $D≻C≻T$. In this scenario, $ε$ is how smart the person is. Another scenario is that the training is harmful but the D group drops out because they found a job due to a lower reservation wage, resulting in $D≻C≻T$. In this scenario, $ε$ is reservation wage.

If one thinks further, many more scenarios will come up, possibly based on different unobserved confounders. It is not far-fetched to say that in observational studies, negating those scenarios one by one to zoom in on one scenario—and thus presenting a coherent story—is the goal. In short, there is a coherence problem in the job-training example with $D≻C≻T$, which needs to be explained before declaring the treatment good or bad. Had the D group been ignored, only $C≻T$ would have been looked at to conclude a bad job training. The partial treatment group, which is an extra treatment group using an extra dose, casts doubt on the causal finding based on only the full versus no treatment groups.

As another example, Ludwig et al. (2001) examined effects of moving into a lower poverty area on crime rates. In observational data, people have some control over where they live, and living in high/low-poverty area has an element of self-selection, which Ludwig et al. avoided using experimental data. Since 1994, 638 families from a high-poverty area in Baltimore were randomly assigned to three groups: the T group relocating into an area with poverty rate under 10%, the D group without constraints on poverty rate for relocation, and the C group. The D group is partially treated, because they could (and some did) move into an area with poverty rate higher than 10%. The outcome variables are juvenile arrest records. A total of 279 teens were arrested 998 times in the pre- and postprogram periods. The crimes were classified into violent crimes (assault, robbery), property crimes, and the other crimes (drug offenses, disorderly conduct).

Part of their Table III for juveniles of ages 11–16 is shown below (some covariates are controlled). The second column shows the mean number of arrests for 100 teens per quarter. The third column shows the treatment effect of the T versus C groups, whereas the fourth column shows the treatment effect of the D versus C groups. The two entries with one asterisk are significant at 10% level, and the entry with two asterisks is significant at 5%.

Effects of Moving into Low Poverty Area on Crimes

Mean arrests for C

T versus C (SD)

D versus C (SD)

Violent crimes

3.0

$−$1.6 (0.8)$∗∗$

$−$1.4 (0.8)$∗$

Property crimes

2.0

1.3 (0.8)

$−$0.5 (0.8)

Other crimes

3.3

$−$0.7 (1.0)

$−$1.3 (1.0)

All crimes

8.3

$−$0.9 (1.8)

$−$3.1 (1.8)$∗$

(p.222) The treatment indeed seems to lower crime rate; notice a scope for property crime increase, because a high-income area presents more opportunity for property crime. But in terms of all crime rates, we have the ranking $C≻T≻D$, which is strange, for one would expect $C≻D≻T$. One possible scenario is that the areas for the T group may have higher arrest probabilities than the areas for the D group; in other words, crime rates are overestimated in the T group areas relative to the D group areas. In this case, arresting intensity/effort is an unobserved confounder.

### Reverse Treatment

If a treatment change from 0 to 1 has a positive effect, then the reverse change from 1 to 0 should have a negative effect, which is another way of being coherent. This can be checked out with two similar groups where one group experience 0 to 1 and the other group 1 to 0; for each group, before-and-after (BA) design is implemented. Contrast this to difference in differences (DD) where one group experience 0 to 1 and the other group no change; reverse treatment design is better than DD, because distinction between the two groups is clearer if indeed the treatment has some effect.

It is also possible to try reverse treatment design on a single group: the treatment change is 0 to 1 to see the effect, and then reversed back to 0 to see the reverse effect. If the treatment is effective, $Y$ will take on level A, B, and back to A, as the treatment changes from 0 to 1, and back to 0. Comparing this one-group three-point design with the preceding two-group two-point design (here, ‘point’ refers to time points), in the former, we do not have to worry about the difference between the two groups but we do have to be concerned about the time effect because three time points are involved. In the latter, the opposite holds.

For time-series or panel data, suppose we use BA to find a positive effect of $D$ (water fluoridation) on $Y$ (tooth decay) over five years. In the beginning of the period, fluoridation started (treatment changing from 0 to 1) and lasted for five years. Comparing the tooth decay proportions at the beginning and end of the five year period, the proportion has been found to be lowered. But during this period, other things may have changed to affect $Y$. For instance, healthy lifestyles might have been adopted (lower sugar diet due to enhanced health concern including oral health), and this could have been the actual reason for the lower tooth decay proportion. To refute this possibility, suppose fluoridation stopped (the treatment changing from 1 to 0) and stayed that way for another five years. Suppose tooth decay proportion increased during this second five-year period. If unhealthy lifestyle was adopted during this period, then again this might explain the higher tooth decay proportion, which is unlikely; hence, the reverse treatment corroborates the initial finding. This example is a modified version of actual studies on fluoridation referred to in Gordis (2000, 7–9).

### Multiple Responses

There have been claims on beneficial effects of moderate drinking of alcohols— particularly red wine—on heart disease. Since there are potential risks in drinking, it is difficult to do an experiment, and studies on that causal link are observational where (p.223) people self-select their drinking level. Thun et al. (1997) examined a large data set on older U.S. adults with $N=$ 490,000. In 1982, the individuals reported on their drinking habits, and 46,000 died during the nine-year follow-up. In the study, drinking habit was measured separately for beer, wine, and spirits; the sum was then recorded as the total number of drinks per day. It was found that moderate drinking reduces death rates from cardiovascular diseases.

Part of their Table 4 for women is

Deaths (SD) per 100,000 and Number of Drinks per Day

Cause of death

0

Less than 1

1

2–3

4 or more

Cirrhosis, alcoholism

5.0 (0.9)

4.3 (0.9)

7.7 (1.9)

10.4 (1.9)

23.9 (4.5)

Cardiovascular diseases

335 (7.8)

230 (7.5)

213 (10.2)

228(9.8)

251(16)

Breast cancer

30.3(2.1)

33.3(2.4)

37.6(4.1)

45.8(4.2)

29.1(5.3)

Injuries & external causes

22.7 (1.9)

25.5 (2.2)

17.7 (2.8)

18.9 (2.7)

17.1 (4.0)

Examining the cirrhosis and alcoholism row, the death rate increases as more alcohol is consumed. The death rate for cardiovascular diseases decreases for moderate drinking but increases as the number of drinks goes up. The death rate from breast cancer increases substantially but then it drops for four drinks or more, which casts some doubt on the study. The most problematic is the death rate for injuries and external causes, which is decreasing for one drink or more.

If we do a randomized study, then we would expect that drinkers have more accidents (thus a higher death rate for injuries and external causes), because being drunk makes the person less alert and less careful. Being otherwise suggests that drinkers may be systematically different from nondrinkers. Drinkers may be more careful and attentive to their health and lifestyle, and this may be the real reason for the lower cardiovascular disease death rate. Wine drinkers are sometimes reported to have healthy lifestyle in the United States. This may have to do with the fact that wines are more expensive than beers and better educated people with more money drink wines. That is, better education could be the common factor driving wine drinking and healthy lifestyle in the United States. Looking at the extra response variable (death rate due to injuries and external causes), we can see a possible hidden bias due to the unobserved confounders such as alertness/carefulness and healthy lifestyle due to high income and education.

In the drinking example, the extra response variable is expected to be affected by the treatment in a known direction. There are cases where an extra response variable is not supposed to be affected at all. For example, consider the effect of a lower speed limit on the number of traffic accidents. One unobserved confounder is police patrol intensity: it is possible that the police patrol is intensified to enforce the lower speed limit, which then reduces the number of traffic accidents, whereas the real effect of the lower speed limit per se is nil. In this example, an extra response variable can be crime rate not supposed to be affected by speed limit. If crime rate does not change following the speed limit change, then we can rule out the possibility of the intensified (p.224) patrol efforts affecting traffic accidents. Of course, the best thing would be to find a variable representing police patrol effort and see if it really changed. When this is not done, the next best thing would be to use the extra response variable (crime rate) to detect changes in police patrol intensity.

### Multiple Control Groups

Zero is an intriguing number, and no treatment can mean many different things. With drinking as the treatment, it may mean the real nondrinkers, but it may also mean the people who used to drink heavily long time ago and then stopped for health reasons (ex-drinkers). With a job training as the treatment, no treatment can mean people who never applied to the program, but it can also mean people who had applied but then were rejected. As the real nondrinkers differ from the ex-drinkers, the nonapplicants differ from the rejected. In the job training example, there are two control groups: the nonapplicants and the rejected. Both groups did not receive the treatment, but they can differ in terms of unobserved confounders.

It is possible to detect the presence of unobserved confounders using multiple control groups. Let C denote the nonapplicants and C $r$ the rejected. Suppose $E(Y|X,C)≠E(Y|X,Cr)$. This must be due to an unobserved variable $ε$, raising the suspicion that the T group might be also different from C and C $r$ in terms of $ε$. Specifically, to ensure the program success, the program administrators may have “cherry-picked” applicants with higher values of $ε$ that can be quality or ability. Then $Cr$ comprises people with low $ε$. In this example, comparing the C group with the extra control group $Cr$ helps one see the presence of an unobserved confounder.

Card and Krueger (1994, 2000) analyzed the effect of minimum wage increase on employment. In 1992, New Jersey increased its minimum wage from $4.25 to$5.05 per hour. From New Jersey and the eastern Pennsylvania, 410 fast food restaurants were sampled before and after the minimum wage change (treatment) for DD. That is, PA fast food restaurants were used as a control group. Not just PA but also NJ fast food restaurants with starting wage higher than $5 were used as another control group because those NJ restaurants were unlikely to be affected by the treatment. The next table is part of Table 3 in Card and Krueger (1994) and it shows the average (SD), where ‘FTE (full-time equivalent)’ is the number of full-time workers plus 0.5 times the number of part-time workers, ‘NJ ($4.25)’ is for the NJ restaurants with the pretreatment starting wage $4.25 (affected by the treatment), and ‘NJ($5)’ is for the NJ restaurants with the pretreatment starting wage $5 or above (little affected by the treatment). DD with Two Control Groups for Minimum Wage Effect on Employment NJ PA NJ-PA NJ ($4.25)

NJ ($5) NJ ($4.25)-NJ($5) FTE before 20.44 23.33 $−$2.89 19.56 22.25 $−$2.69 FTE after 21.03 21.17 $−$0.14 20.88 20.21 0.67 Difference 0.59 $−$2.16 2.75 (1.36) 1.32 $−$2.04 3.36 (1.48) (p.225) From the last row of the left half, despite the minimum wage increase, NJ FTE increased whereas PA FTE decreased; the DD estimate is significantly positive, showing no negative effect of the minimum wage increase. From the right half using NJ($5) as a second control group, almost the same finding is seen. The second control group renders a coherent story, behaving similarly to the first control group. In this DD, no covariates were controlled for; Card and Krueger (1994) tried many regression models, only to conclude no evidence of employment reduction due to the minimum wage increase.

## A.3.2 IVE and Complier Effect

### Two-Stage LSE and IVE

Imagine a health education program where individuals are randomized in ($δ=1$) or out ($δ=0$) to be given some education on health benefits of exercise $D$, and we are interested in the effects of exercise $D$ on health $Y$, not the effects of $δ$ on $Y$. One concern, however, is that there may be unobserved confounders $ε$ affecting both $D$ and $Y$ to make $D$ an endogenous treatment. For instance, $ε$ may be laziness which affects exercise $D$ and health $Y$. In this case, a simple LSE of $Y$ on $(1,D)$ will not work. One solution for this problem is using the ‘exogenous variation’ in $D$ caused by the randomization dummy $δ$. Since $δ$ does not affect $Y$ directly, ‘$δ⟶D⟶Y$‘ holds; $δ$ affects $Y$ only indirectly through $D$. Instrumental variable estimator (IVE) can be applied with $δ$ as an instrumental variable (IV) for $D$.

To formalize the idea, suppose, for mean-zero errors $ε$ and $U$,

$Display mathematics$
where $COR(D,U)≠0$ due to laziness lurking in $ε$ and $U$, but $COR(δ,U)=0$. The interest is on $βd$ that is the slope of the endogenous treatment $D$. The assumption $αδ≠0$ is critical for $δ$ to give an exogenous variation to $D$; in the foregoing example, the education on health benefits of exercise should make at least some people exercise. Doing the LSE of $D$ on $(1,δ)$ to get $(α^0,α^δ)$ and then doing the LSE of $Y$ on $(1,D^)$ where $D^≡α^0+α^δδ$, we can estimate $βd$ consistently. This is the well-known two-stage LSE, which equals the IVE below.

Rewrite the $D$ and $Y$ structural form (SF) equations as

$Display mathematics$
With $E−1(⋅)$ denoting ${E(⋅)}−1$, the IVE that is consistent for $β$ is
$Display mathematics$
Here, $δ$ is an IV for $D$; we also say that $G$ is an instrument (vector) for $W$. With no covariates $X$, call the IVE ‘simple IVE’.

(p.226) As can be seen in the last equation, an IV $δ$ has to meet two necessary conditions:

$Display mathematics$
the former is the ‘inclusion restriction$αδ≠0$—the education should be effective in inducing exercises—and the latter holds due to the randomization of $δ$. An additional requirement should hold that $δ$ do not enter the $Y$ equation directly, which is an ‘exclusion restriction’—$δ$ can influence $Y$ only indirectly though $D$. In short, IV should meet three conditions: inclusion restriction, exclusion restriction, and zero correlation with the model error term.

Substituting the $D$ SF into the $Y$ SF, we get the $Y$ reduced-form (RF) equation:

$Display mathematics$
This shows that if we are interested only in ‘$H0:βd≠0$‘, then we can test if the slope of $δ$ is zero in the LSE of $Y$ on $(1,δ)$ because $αδ≠0$—this LSE works because $δ$ is exogenous. The $Y$ RF shows in fact more: the slope of $δ$ is
$Display mathematics$
which is the product of the two effects: $αδ$ for $δ$ on $D$, and $βd$ for $D$ on $Y$. This shows yet another way of finding $βd$: do the LSE of $D$ on $(1,δ)$ to get $α^δ$ and $Y$ on $(1,δ)$ to get $γ^δ$, and finally the ratio $γ^δ/α^δ$ for $βd$.

So far, we introduced three ways of finding $βd$: two-stage LSE, simple IVE, and the ratio $γ^δ/α^δ$. Although they may look different, they are numerically the same. The fact that two-stage LSE and simple IVE are the same is well known (see, e.g., Lee 2010a), and the equivalence of the simple IVE slope and the ratio $γ^δ/α^δ$ will be seen shortly under the name ‘Wald estimator’.

Suppose covariates $X$ with $COR(X,U)=0$ appears in the $D$ and $Y$ equations as in

$Display mathematics$
Under $COR(δ,D|X)≠0$ and $COR(δ,U|X)=0$, IVE takes the same form as the simple IVE, but with
$Display mathematics$
When there are more instruments than necessary so that $G$ has more elements than $W$, a generalized version of IVE or ‘generalized method of moments’ (GMM) can be applied. More than enough IVs appear naturally when a conditional moment condition such as $E(U|Z)=0$ is available because functions of $Z$ are candidate IVs.

### Wald Estimator

In the LSE of $Y$ on $(1,D)$, the slope estimator equals the sample mean difference of the two groups $D=0,1$, which was proven in Chapter 1. There is an analogous relation as follows between ‘simple IVE with a binary instrument $δ$ for a binary regressor $D$‘ and ‘the ratio of the group mean differences’ $E(Y|δ=1)−E(Y|δ=0)$ and $E(D|δ=1)−E(d|δ=0)$.

(p.227) Recall the simple IVE whose slope is consistent for

$Display mathematics$
if $D=δ$, this equals the slope parameter $COV(Y,δ)/V(δ)$ of the LSE of $Y$ on $(1,δ)$. Observe
$Display mathematics$
Rewrite the denominator $E(Dδ)−E(D)E(δ)$ in the preceding display as
$Display mathematics$
Analogously, the numerator $E(Yδ)−E(Y)E(δ)$ equals
$Display mathematics$
Canceling $P(δ=0)P(δ=1)$ that appears in both the denominator and numerator gives
$Display mathematics$

The sample version for the last ratio of the group mean differences is the Wald estimator:

$Display mathematics$
In the causal route $δ→D→Y$, the numerator of the Wald estimator is for the multiplicative indirect effect $αδβd$ of $δ$ on $Y$, and the denominator is for the effect $αδ$ of $δ$ on $D$; by the division, the direct effect $βd$ of $D$ on $Y$ is recovered. This is the aforementioned equivalence of simple IVE to the LSE-based ratio $γ^δ/α^δ$.

In a clinical trial where $δ$ is a random assignment and $D$ is “compliance” if $D=δ$ and “noncompliance” if $D≠δ$, $E(Y|δ=1)−E(Y|δ=0)$ is called the ‘intent-to-treat effect’, because it shows the effect of treatment intention (i.e., assignment), not of the actual treatment received. Noncompliance to treatment dilutes the true effect, and the Wald estimator blows up the diluted effect with the factor ${E(D|δ=1)−E(D|δ=0)}−1$. This is the ‘rescaling’ role of the Wald estimator denominator.

So far, a constant treatment effect has been assumed that is the same for all individuals. If treatment effect is heterogeneous to vary across individuals, then IVE can be inconsistent. To see this, recall $Y=Y0+(Y1−Y0)D$ and suppose that the individual effect $Y1−Y0$ is not a constant but $Yi1−Yi0=βd+Vi$with $E(V)=0$. (p.228) Then

$Display mathematics$
where $E(Y0+VD)$ is the intercept and the term in ${⋅}$ is the error. The trouble is $VD$ in the error term, because the instrument $δ$ may be related to $VD$ as $COR(δ,D)≠0$; if $V⨿(D,δ)$, then IVE is consistent because $E(δVD)=E(V)E(δD)=0$. Since $V$ is part of the treatment effect $Y1−Y0$, ‘$V⨿(D,δ)$‘ would be questionable at best. Despite this problem due to heterogeneous effects, IVE is still consistent for an interesting parameter, as is shown next.

### Wald Estimator for Effect on Compliers

Since $δ$ affects $D$, we can imagine potential treatments $(D0,D1)$ depending on $δ=0,1$, analogously to the potential responses $(Y0,Y1)$ depending on $D=0,1$. We only observe $(δ,D,Y)$, although $(δ,D0,D1,Y0,Y1)$ is considered. Classify the individuals with $(D0,D1)$, following Imbens and Angrist (1994) and Angrist et al. (1996):

$Display mathematics$

For the exercise ($D$) example, never-takers never exercise regardless of the education $δ$ on benefits of exercise; compliers exercise only when educated ($δ=1$); defiers exercise only when not educated ($δ=0$); always-takers always exercise regardless of $δ$. We should observe both $D0$ and $D1$ to know the type of a person, but only one of $D0$ and $D1$ is observed; hence, the type is unknown. Since the grouping based on $(D0,D1)$ is not affected by $δ$, this is a ‘principal stratification’ (Frangakis and Rubin 2002). In contrast, the membership for the $D=0$ or $D=1$ group changes as $δ$ changes, so long as $δ$ affects $D$. For instance, the compliers belong to $D=0$ (along with the never-takers) when $δ=0$, but they belong to $D=1$ (along with the always-takers) when $δ=1$.

Suppose

1. 1. $P(D=1|δ)$ changes when $δ$ changes;

2. 2. $(Y0,Y1,D0,D1)⨿δ$;

3. 3. Either $D0≤D1$ or $D0≥D1$ (monotonicity).

Condition (a) is the inclusion restriction that $δ$ is in the $D$ equation to affect $D$. Condition (b) amounts to the exclusion restriction that $δ$ is not in the $Y0$ and $Y1$ equations. Condition (c) is a monotonicity assumption.

One example in which the three conditions hold is

$Display mathematics$
(p.229) Here, $Y1−Y0=β1−β0+V$ with $V≡U1−U0$: the effect varies across individuals. Condition (a) holds due to $αδ≠0$. Condition (b) holds because $δ$ is independent of $(ε,U0,U1)$ and
$Display mathematics$
Condition (c) holds with $≤$ or $≥$ depending on $αδ≷0$. We can allow $αδ$ to vary across individuals (say, $αδi$) without disturbing the above conditions, as long as all $αδi$‘s take the same sign. Without loss of generality, assume $D0≤D1$ to rule out defiers from now and onward.

Observe

$Display mathematics$

Since $D1−D0=1⟺D1=1,D0=0$ (complier), dividing the first and last expressions with $P(D1−D0=1)$ gives the effect on the compliers

$Display mathematics$
which is the Wald estimator probability limit; the last equality holds because
$Display mathematics$

The effect on compliers is also called the ‘local average treatment effect’ (LATE) (Imbens and Angrist 1994). The qualifier ‘local’ refers to the fact that LATE is specific to the instrument in use. Bear in mind that the LATE interpretation of the simple IVE (i.e., Wald estimator) requires the above three conditions, and that LATE can change as the instrument in use changes. If IVE changes as the instrument changes, then this is an indication for heterogeneous treatment effects (or some instruments may be invalid). IVE estimating the effect on those who change their behavior as the instrument value changes looks natural.

Abadie (2003) showed that the effect on the compliers can be written also as

$Display mathematics$
(p.230) The proof is similar to that for $E{(D1−D0)(Y1−Y0)}$. For the first term, observe
$Display mathematics$
$E(Y1D0)$, not $E(Y0D0)$, appears because $D0=1$ means treated. Divide the first and last expressions by $E(D|δ=1)−E(D|δ=0)$ to obtain
$Display mathematics$
Analogously, the second term holds due to the following:
$Display mathematics$

## A.3.3 Selection Correction Approach

Other than IVE, there are a number of ways to deal with unobserved confounders: sensitivity analysis, bounding method, and selection correction approach. Here we examine only selection correction approach for binary $D$, eschewing sensitivity analysis and bounding method that are not popular in practice. For the sensitivity analysis, interested readers can refer to Lee (2004), Altonji et al. (2005), Ichino et al. (2008), Lee and Lee (2009), Rosenbaum (2010), Huber (2014), and references therein. As for the bounding method, see Manski (2003), Tamer (2010), Choi and Lee (2012), Nevo and Rosen (2012), Chernozhukov et al. (2013), and references therein.

With $W$ denoting covariates including $X$, suppose

$Display mathematics$
This model includes the exclusion restriction that $C$ is excluded from the $Y$ equation, which is not necessary, but helpful; see, for example, Lee (2010a). From the model, we obtain
$Display mathematics$
$τ$ needs $β1−β0$, and $τ1$ and $τ0$ need $E(U1|D=d)−E(U0|D=d)$ additionally.

(p.231) It holds that

$Display mathematics$
From this,
$Display mathematics$

The parameters can be estimated with ‘Heckman two-stage estimator’ (Heckman 1979) applied separately to the T and C groups. First, $α/σε$ is estimated by probit $α^$, and then

$Display mathematics$
yield estimates, respectively, for
$Display mathematics$
Let $γ^d$ denote the LSE for $γd$, and let the asymptotic variance for $N(γ^d−γd)$ be $Cd$ with $C^d→pCd$. Stack the two estimates and parameters: $γ^≡(γ^0′,γ^1′)′$ and $γ≡(γ0′,γ1′)′$. Then the asymptotic variance of $N(γ^−γ)$ is $C≡diag(C0,C1)$ and $C^≡diag(C^0,C^1)→pC$. The online appendix has a program ‘SelCorcWorkOnVisit’ for the empirical example below, and the program shows how to obtain $C^$ with the first-stage probit error taken into account, drawing on Lee (2010a); if getting $C^$ looks hard, use nonparametric bootstrap.

Suppose $X$ has the dimension $k×1$. Define

$Display mathematics$
An estimator for $τ=E(X′)(β1−β0)$ and its asymptotic variance estimator are, respectively,
$Display mathematics$

(p.232) Let $X¯d$ be the sample mean of $X$ for the subsample $D=d$, and

$Display mathematics$
An estimator for $τd$ and its asymptotic variance estimator are
$Display mathematics$
$2d−1=1$ if $d=1$ and $−1$ if $d=0$.

For a simple illustration, recall the example of work ($D$) effect on doctor office visits per year ($Y$) in Chapter 1. Setting $W=X$ (no particular variable to exclude from $X$) and

$Display mathematics$
where $male$ and $married$ are dummies and $phy$ $(psy)$ is self-assessed physical (psychological) condition in five categories (the lower the better). The estimated effects (t-value in $(⋅)$) are
$Display mathematics$
Interestingly, $τ0$ is significantly negative: making the nonworkers work will reduce doctor office visits by 11 per year. There are 31% nonworkers in the data.

# A.4 Supplements for DD Chapter

This section supplements the DD chapter by presenting various nonparametric DD estimators and discussing ‘one-shot’ treatment. These were omitted in the main text to keep the DD chapter within a reasonable limit; for the same reason, we also put the clustered data issues in the TD chapter. Also, a generalization of DD, called ‘change in changes’, is reviewed near the end of this section.

## (p.233) A.4.1 Nonparametric Estimators for Repeated Cross-Section DD

Recall the covariates $W$, the treatment qualification dummy $Q$ and the sampling dummy $S$ for the post-treatment period. Let

$Display mathematics$

Recall the $W$-conditional effects on the treated, untreated, and population:

$Display mathematics$
where $w$ are to be integrated out using $FW|Q=1,S=1=FW3|Q=1$, $FW|Q=0,S=1=FW3|Q=0$, and $FW|S=1=FW3$ for the respective marginal effects. In view of this, consistent estimators for the effect on the treated, untreated, and population are, respectively,
$Display mathematics$
where $#{⋅}$ denotes the number of members in ${⋅}$.

## A.4.2 Nonparametric Estimation for DD with Two-Wave Panel Data

With only two periods ($2$ and $3$ for before and after) in hand, nonparametric estimation can be done with $ΔY3=Y3−Y2$ as a single response variable and $W23≡(C′,X2′,X3′)′$ as the covariates; recall $Wit≡(Ci′,Xit′)′$. The resulting estimators are analogous to matching and nonmatching estimators in Chapters 2 and 3 for cross-section data. In this sense, this section may be taken as a “review” on nonparametric estimators for treatment effects. We present four estimators that appeared in Chapters 2 and 3: matching, weighting, regression imputation (RI), and complete pairing (CP). With only two waves, the time-constancy of $Q$ does not matter, as $Qi3$ can be taken as $Qi$. In this subsection, we write

$Display mathematics$

### (p.234) Matching Estimators

A matching estimator for the effect on the treated under ID $DD$ $ΔY0⊥Q|W$ is

$Display mathematics$
where $N1=∑iQi$, $Ct$ is the comparison group for treated $t$ based on a $W$-distance and $|Ct|$ is the number of the controls in $Ct$. Henceforth, for simplification, we pretend that all $ΔYt$‘s are used although some treated individuals may be passed over in practice if no good matched controls are found (i.e., if $|Ct|=0$). The dimension problem in matching can be avoided with the propensity score $π(W)≡P(Q=1|W)$ in place of $W$. As for the asymptotic inference, nonparametric bootstrap may be used.

A matching estimator for the effect on the untreated under ID $DD′$ $( Y 3 1 . Y 2 0 )⊥Q|W$ is

$Display mathematics$
where $Tc$ is the comparison group for control $c$. We may define $ΔY1≡Y31−Y20$ to rewrite ID $DD′$ as $ΔY1⊥Q|W$, which would then go parallel with ID $DD$ $ΔY0⊥Q|W$.

Combining the two estimators, an estimator for the effect on the population under $ΔY0⊥Q|W$ and $( Y 3 1 . Y 2 0 )⊥Q|W$ is

$Display mathematics$
Propensity score matching can be also applied to the effects on the untreated and on the population.

### Weighting Estimators

Recall, for cross-section data,

$Display mathematics$
which is the probability limit of the weighting estimator $τ^1w$ in Chapter 3. To do analogously for DD, under ID $DD$, we just have to use
$Display mathematics$
Abadie (2005) used this to propose a number of nonparametric weighting estimators where $π(⋅)$ is nonparametrically estimated. An estimator based on the first term of this (p.235) display for the effect on the treated is
$Display mathematics$

Analogously, the effects on the untreated is

$Display mathematics$
and its sample version is
$Display mathematics$

The effect on the population is the weighted average of $E(Y31−Y30|Q=1)$ and $E(Y31−Y30|Q=0)$, an estimator for which is

$Display mathematics$

### Regression Imputation Estimators

A RI estimator for the effect on the treated under $ΔY0⊥Q|W$ is

$Display mathematics$
To be specific on $E^(ΔY|Wt,Q=d)$, we can use, for a kernel $K$ and a bandwidth $h$,
$Display mathematics$

Analogous to the RI estimator for the effect on the treated are the following estimators for the effect on the untreated under $(Y31−Y20)⊥Q|W$ and on the population under $ΔY0⊥Q|W$ and $(Y31−Y20)⊥Q|W$:

$Display mathematics$
(p.236)
$Display mathematics$

### Complete Pairing Estimator

A CP estimator for DD with continuously distributed $W$ is

$Display mathematics$
where $ω23(w)$ is the product of density of $W|Q=1$ and density of $W|Q=0$ evaluated at $w$.

Suppose $ΔY0⊥Q|W$ and $(Y31−Y20)⊥Q|W$ (or $(ΔY30,Y31−Y20)⨿Q|W23$) hold so that we can drop $Q$ in the integrand of $μ23$, which then becomes

$Display mathematics$
Hence we will be estimating the effect on the population. CP can be done also with $π(W)$ instead of $W$ to avoid the dimension problem; that is, replace $W$ in $DD^23$ with $π^(W)=P^(Q=1|W)$ that is the estimated probit/logit probability. Nonparametric bootstrap or the CP asymptotic variance ignoring the $π(W)$-estimation error can be used for asymptotic inference.

## A.4.3 Panel Linear Model Estimation for DD with One-Shot Treatment

Recall the covariates and treatment-interacting covariates for panel data DD:

$Display mathematics$
where $Ai$ consists of elements of $Ci$ and $Hit$ consists of elements of $Xit$. With one-shot treatment at $t=τ$, we have $Dit=Qi1[t=τ]$ and the model becomes, for $t=0,…,T$,
$Display mathematics$
The treatment effect is $βd+βdg′Gτ$ only at period $t=τ$, and zero at the other periods.

(p.237) Observe

$Display mathematics$
Using these, difference the $Yit$ equation to obtain
$Display mathematics$

To implement DD with this, LSE can be applied under

$Display mathematics$
With no interacting covariates ($βdg=0$), the LSE is for $ΔYit$ only on
$Display mathematics$
the slope of $Qi(1[t=τ]−1[t=τ+1])$ is the desired effect $βd$.

To implement the LSE with $t=0,1,2,3,4,5$ and $τ=3$, observe

$Display mathematics$
the $ΔYi3$, $ΔYi4$, and $ΔYi5$ equations are obtained by setting $t=3,4,5$ in the $ΔYit$ equation:
$Display mathematics$

## (p.238) A.4.4 Change in Changes

For repeated cross-sections, Athey and Imbens (2006) proposed a DD without assuming the additivity of the effects. In this section, we review their approach, which is hard to use in practice, not least because of the difficulty in allowing for covariates.

Athey and Imbens assumed that the untreated response $Y0$ is determined by

$Display mathematics$
for an unknown function $h(u,s)$ strictly increasing in $u$; recall that $S=1$ is the dummy for being sampled in the treated period. In this model, $Y0$ is the same across the treatment and control groups because the group dummy $Q$ does not enter $h$ directly, although $Q$ may enter indirectly; for example, $U=(1−Q)U0+QU1$ for group $0$ and $1$ error terms $U0$ and $U1$.

Further assume

$Display mathematics$
The assumption $(i)$ is a weaker version of $S$ being independent of the other random variables, and $(ii)$ is needed to construct the desired counterfactual $Y0|(Q=1,S=1)$ using the identified distributions of the control group’s before and after responses,$Y0|(Q=0,S=0)$ and $Y0|(Q=0,S=1)$. To simplify exposition, assume that $U$ is continuously distributed.

The main challenge in any cross-section DD is constructing the counterfactual $Y0|(Q=1,S=1)$. If we knew $h(⋅,⋅)$ in $Y0=h(U,S)$, then we would be able to find $Y0|(Q=1,S=1)$ by using the inverse function $h−1(⋅;S)$ of $h(U,S)$ for a given $S$. Specifically, first, find $U$ using

$Display mathematics$
Second, with this $U$, the desired counterfactual $Y0|(Q=1,S=1)$ would be
$Display mathematics$
Although $h$ is not observed and thus this scenario does not work, it is enough to know the distribution of $Y0|(Q=1,S=1)$ to obtain its mean as follows, not the individual $Yi0|(Qi=1,Si=1)$ for each $i$.

With $FY0,jk$ denoting the distribution function of $FY0|Q=j,S=k$, the key equation in Athey and Imbens (2006) that gives the desired counterfactual distribution is

$Display mathematics$
$Y$ on the right-hand side equals $Y0$ because $(10)$, $(00)$, and $(01)$ groups are all untreated. Using this, the desired counterfactual mean $E(Y0|Q=1,S=1)$ is (p.239)
$Display mathematics$

Using the last expression $∫FY,01−1{FY,00(y′)}⋅∂FY,10(y′)$ for $E(Y0|Q=1,S=1)$, Athey and Imbens proposed “change in changes (CC)”:

$Display mathematics$
which is the effect on the treated at the post-treatment era.

A sample version of CC is

$Display mathematics$
where $F^Y,jk$ is the empirical distribution function of $Y|(Q=j,S=k)$. Covariates can be allowed by conditioning the above means and distributions on $X$ (i.e., using only the observations sharing the same value of $X$ in the sample version), but this is hardly practical.

A simple intuitive way to understand CC is using “quantile time effect.” Recall that the main task in DD is constructing the counterfactual untreated response with only the time effect in the treatment group. More specifically, the task is finding the time effect using only the control group first, and then adding it to the treatment group’s pretreatment response, under the identification condition that the time effect is the same across the two groups.

To see how the time effect can be extracted using the control group, use the “$p$-quantile time effect” for the control group:

$Display mathematics$
Without the time effect, this equation becomes $y−y=0$ because $FY,01=FY,00$. That is, the function
$Display mathematics$
gives the time-effect-augmented untreated response. Hence, putting $Y|(Q=1,S=0)$ into this function gives the time-effect-augmented version $Y0|(Q=1,S=1)$ as was seen in CC:
$Display mathematics$
(p.240)