A.1 Kernel Density and Regression Estimators
A.1.1 HistogramType Density Estimator
Suppose ${X}_{1},\dots ,{X}_{N}$ are iid random variables with df $F(x)$ and density function $f(x)$ that is bounded and continuously differentiable. Consider a $B(N,\pi )$ random variable (binomial with $N$ trials and the success probability $\pi $) where $h>0$ is a bandwidth:
$$\sum _{i=1}^{N}1[{X}_{i}x<h]\text{}\text{}\text{}\text{}\text{}\text{with}\pi \equiv P(Xxh)=P(xhXx+h).$$
It holds that
$$E(\sum _{i=1}^{N}1[{X}_{i}x<h])=N\pi \text{}\text{}\text{}\text{}\text{}\text{and}V(\sum _{i=1}^{N}1[{X}_{i}xh])=N\pi (1\pi ).$$
A histogramtype density estimator for $f(x)$ with interval size $2h$ is
$$\overline{f}(x)\equiv \frac{1}{N}\sum _{i=1}^{N}\frac{1}{2h}1[{X}_{i}x<h]=\frac{1}{Nh}\sum _{i=1}^{N}\frac{1}{2}1[\frac{{X}_{i}x}{h}<1].$$
Without
$2h$ in the denominator,
$\overline{f}(x)$ would be a histogram showing the proportion of observations falling in
$x\pm h$. Since
$\overline{f}(x)$ is an estimator for
$f(x)$ without parameterizing
$f(x)$ such as normal or logistic,
$\overline{f}(x)$ is a ‘nonparametric estimator’ for
$f(x)$. For instance, if we know
$X\sim N(\mu ,{\sigma}^{2})$ with unknown parameters
$\mu $ and
${\sigma}^{2}$, then
$f(x)$
(p.210)
can be estimated parametrically with
$$\frac{1}{{s}_{N}\sqrt{2\pi}}\mathrm{exp}\{\frac{1}{2}(\frac{x\overline{X}}{{s}_{N}}{)}^{2}\},\text{}\text{}\text{}\text{}\text{}\text{where}\overline{X}\equiv \frac{1}{N}\sum _{i}{X}_{i}\text{}\text{}\text{}\text{and}{s}_{N}^{2}\equiv \frac{1}{N}\sum _{i}({X}_{i}\overline{X}{)}^{2}.$$
In the following, we show
$\overline{f}(x){\to}^{p}f(x)$; see, for example,
Lee (2010a) for more on nonparametric density and regression estimation. Be mindful about the distinction between the data
${X}_{1},\dots ,{X}_{N}$ and
the evaluation point $x$ that does not have to equal any of ${X}_{1},\dots ,{X}_{N}$.
Using the mean and variance for $\sum _{i}1[{X}_{i}x<h]$, we get
$$E\{\overline{f}(x)\}=\frac{N\pi}{N2h}=\frac{\pi}{2h}\text{}\text{}\text{}\text{and}V\{\overline{f}(x)\}=\frac{\pi (1\pi )}{4N{h}^{2}}=\frac{\pi}{2h}\frac{1\pi}{2Nh}.$$
With
$\pi =F(x+h)F(xh)$, Taylor expansion yields, for some
${x}^{\ast}$ and
$O({h}^{2})$ essentially denoting a constant times
${h}^{2}$,
$$\begin{array}{rcl}& & \pi =\{F(x)+f(x)h+\frac{{f}^{\mathrm{\prime}}({x}^{\ast}){h}^{2}}{2}\}\{F(x)f(x)h+\frac{{f}^{\mathrm{\prime}}({x}^{\ast}){h}^{2}}{2}\}=2f(x)h+O({h}^{2})\\ & & \phantom{\rule{1em}{0ex}}\u27f9\text{}\frac{\pi}{2h}=f(x)+O(h).\end{array}$$
Substitute $\pi /2h=f(x)+O(h)$ into the above $E\{\overline{f}(x)\}$ and $V\{\overline{f}(x)\}$ to obtain
$$E\{\overline{f}(x)\}=f(x)+O(h)\text{}\text{}\text{}\text{and}V\{\overline{f}(x)\}=\{f(x)+O(h)\}\frac{1\pi}{2Nh}.$$
Let
$h\to 0$ and
$Nh\to \mathrm{\infty}$ as
$N\to \mathrm{\infty}$. Then, as
$N\to \mathrm{\infty}$,
$$E\{\overline{f}(x)\}\to f(x)\text{and}V\{\overline{f}(x)\}\to 0;$$
$h\to 0$ makes the bias
$E\{\overline{f}(x)\}f(x)$ disappear, whereas
$Nh\to \mathrm{\infty}$ makes the variance disappear. This implies
$\overline{f}(x){\to}^{p}f(x)$, because
$\overline{f}(x){\to}^{p}E\{\overline{f}(x)\}$ due to an LLN and
$E\{\overline{f}(x)\}\to f(x)$.
A.1.2 Kernel Density Estimator
The role of the indicator function $1[{X}_{i}x/h<1]$ in $\overline{f}(x)$ is weighting the $i$th observation: the weight is $1$ if ${X}_{i}$ falls within the $h$distance from $x$, and $0$ otherwise. Generalizing this weighting idea, we can think of a smooth weight depending on ${X}_{i}x$. Let $X$ be now a $k\times 1$ vector; in this case, $h$distance becomes ${h}^{k}$distance; for example, the two dimensional analog of the interval $(xh,x+h)$ is the rectangle around $x$ of size $(2h{)}^{2}=4{h}^{2}$.
A ‘kernel density estimator’ is based on the smooth weighting idea:
$$\hat{f}(x)\equiv \frac{1}{N{h}^{k}}\sum _{i=1}^{N}K\left(\frac{{X}_{i}x}{h}\right),$$
where
$K$ (called a ‘
kernel’) is a smooth multivariate density that is symmetric about
$0$ (e.g., the
$N(0,{I}_{k})$ density). The kernel estimator
$\hat{f}(x)$ includes
$\overline{f}(x)$ as a special case when the ‘uniform kernel’
$K(z)=1[z<1]/2$ is used with
$k=1$.
(p.211)
Analogous to the earlier proof for $\overline{f}(x){\to}^{p}f(x)$, we can show that $\hat{f}(x){\to}^{p}f(x)$. Furthermore, under some regularity conditions,
$$(N{h}^{k}{)}^{1/2}\{\hat{f}(x)f(x)\}\u21ddN\{0,\text{}f(x)\int K(z{)}^{2}\mathrm{\partial}z\},$$
which can be used to construct (pointwise) confidence intervals (CI) for
$f(x)$. Although
$\int K(z)\mathrm{\partial}z=1$,
$\int K(z{)}^{2}\mathrm{\partial}z\ne 1$ in general: we have to find
$\int K(z{)}^{2}\mathrm{\partial}z$ for CIs. One simple way to find
$\int K(z{)}^{2}\mathrm{\partial}z$ is using a ‘Monte Carlo integral’: generating
${Z}_{1},\dots ,{Z}_{n}$ iid
$N(0,1)$, it holds that
$$\frac{1}{n}\sum _{i=1}^{n}\frac{K({Z}_{i}{)}^{2}}{\varphi ({Z}_{i})}{\to}^{p}\int \frac{K(z{)}^{2}}{\varphi (z)}\varphi (z)\mathrm{\partial}z=\int K(z{)}^{2}\mathrm{\partial}z\text{as}n\to \mathrm{\infty}\text{;}$$
the pseudosample average (i.e., the first term) can be used for
$\int K(z{)}^{2}\mathrm{\partial}z$.
Other than the uniform kernel and the $N(0,1)$ kernel $\varphi (.)={(2\pi )}^{1/2}\mathrm{exp}\left\{{(.)}^{2}/2\right\}$, popular kernels are
$$\begin{array}{rcl}& & \frac{3}{4}(1{z}^{2})\cdot 1[z<1]:\text{(trimmed) quadratic kernel,}\\ & & \frac{15}{16}(1{z}^{2}{)}^{2}\cdot 1[z<1]:\text{quartic or biweight kernel.}\end{array}$$
For
$k>1$, ‘product kernels’ consisting of copies of univariate kernels can be used; for example,
$K(z)=\prod _{j=1}^{k}\varphi ({z}_{j})$.
The scalar $h$ is a ‘bandwidth’ or ‘smoothing parameter’, whose role is analogous to that of the interval size in a histogram. If $h$ is too small, there is no grouping (averaging) and $\hat{f}(x)$ will be too jagged as $x$ varies (a small bias but a large variance). If $h$ is too large, $\hat{f}(x)$ will show little variation (a large bias but a small variance). As for histogram interval size, there is no “the best rule” for choosing $h$ in practice. When $k=1,2$, the best strategy is visual inspection: choose $h$ such that the graph $x\u27fc\hat{f}(x)$ is neither too jagged nor too smooth; if anything, slightly undersmooth.
A practical rule of thumb for choosing $h$ is $h\simeq \nu \cdot {N}^{1/(k+4)}$ with, say, $0.5\le \nu \le 3$ if $k$ is $1$ or $2$ with all components of $X$ standardized. For example, if $K(z)=\prod _{j=1}^{k}\varphi ({z}_{j})$ with $z=({z}_{1},\dots ,{z}_{k}{)}^{\mathrm{\prime}}$ used, then,
$$\begin{array}{rcl}K\left(\frac{{X}_{i}x}{h}\right)& =& \prod _{j=1}^{k}\varphi \left[\frac{{X}_{ji}{x}_{j}}{SD({X}_{j})\nu {N}^{1/(k+4)}}\right]\\ & =& \prod _{j=1}^{k}\frac{1}{\sqrt{2\pi}}\mathrm{exp}\{\frac{1}{2}(\frac{{X}_{ji}{x}_{j}}{SD({X}_{j})\nu {N}^{1/(k+4)}}{)}^{2}\}.\end{array}$$
More discussion on choosing
$K$ and
$h$ appears below.
A.1.3 Kernel Regression Estimator
A kernel regression estimator $\hat{\rho}(x)$ for $\rho (x)\equiv E(YX=x)$ in
$${Y}_{i}=\rho ({X}_{i})+{U}_{i}\text{with}E(UX)=0\u27faE(YX)=\rho (X)$$
(p.212)
is
$$\hat{\rho}(x)\equiv \frac{(N{h}^{k}{)}^{1}\sum _{i=1}^{N}K(({X}_{i}x)/h){Y}_{i}}{(N{h}^{k}{)}^{1}\sum _{i=1}^{N}K(({X}_{i}x)/h)}=\frac{\hat{g}(x)}{\hat{f}(x)},$$
where the numerator of
$\hat{\rho}(x)$ is defined as
$\hat{g}(x)$. Rewrite
$\hat{\rho}(x)$ as
$$\sum _{i=1}^{N}\left\{\frac{K(({X}_{i}x)/h)}{\sum _{i=1}^{N}K(({X}_{i}x)/h)}\right\}\cdot {Y}_{i}$$
to see that
$\hat{\rho}(x)$ is a weighted average of
${Y}_{i}$‘s where the weight is large if
${X}_{i}$ is close to
$x$ and small otherwise.
Similarly for $\hat{f}(x){\to}^{p}f(x)$, it can be shown that
$$\hat{g}(x){\to}^{p}E(YX=x)\cdot f(x),$$
which implies, when combined with
$\hat{f}(x){\to}^{p}f(x)$,
$$\hat{\rho}(x){\to}^{p}\rho (x).$$
Analogously to the asymptotic normality of
$(N{h}^{k}{)}^{1/2}\{\hat{f}(x)f(x)\}$, it holds under some regularity conditions that
$$(N{h}^{k}{)}^{0.5}\{\hat{\rho}(x)\rho (x)\}\u21ddN\{0,\text{}\frac{V(Ux)\int K(z{)}^{2}\mathrm{\partial}z}{f(x)}\}.$$
$V(Ux)=E({U}^{2}x)$ can be estimated using the residual
${\hat{U}}_{i}\equiv {Y}_{i}\hat{\rho}({X}_{i})$:
$$\hat{V}(Ux)\equiv \frac{(N{h}^{k}{)}^{1}\sum _{i=1}^{N}K(({X}_{i}x)/h){\hat{U}}_{i}^{2}}{(N{h}^{k}{)}^{1}\sum _{i=1}^{N}K(({X}_{i}x)/h)}.$$
To implement kernel estimation, one has to choose $K$ and $h$. As for $K$, it is known that the choice of kernel makes little difference. But the story is quite different for $h$, because the choice of $h$ makes a huge difference. When $k=1$ or $2$, other than the above rule of thumb, a good practical method is drawing $\hat{f}(x)$ or $\hat{\rho}(x)$ over a reasonable range for $x$ and choosing $h$ such that the curve estimate is not too smooth (if $h$ is too big) nor too jagged (if $h$ is too small), as in choosing $h$ for density estimation. If $k>2$, this balancing act is hard to do. In this case, it is advisable to choose $h$ by the following scheme.
Consider minimizing for $h$:
$$\sum _{i}\{{Y}_{i}{\hat{\rho}}_{i}({X}_{i}){\}}^{2}\text{}\text{}\text{}\text{}\text{}\text{where}\text{}\text{}{\hat{\rho}}_{i}({X}_{i})\equiv \frac{\sum _{j=1,j\ne i}^{N}K(({X}_{j}{X}_{i})/h){Y}_{j}}{\sum _{j=1,j\ne i}^{N}K(({X}_{j}{X}_{i})/h)};$$
${\hat{\rho}}_{i}({X}_{i})$ is a ‘leaveoneout estimator’ for
$\rho ({X}_{i})$. This method of choosing
$h$ is called ‘
crossvalidation (CV)’, which works well in practice. For estimating
$f(x)$,
${Y}_{i}$ is irrelevant, and a CV minimand for
$h$ is
$$\frac{1}{{N}^{2}{h}^{k}}\sum _{i}\sum _{j}{K}^{(2)}\left(\frac{{X}_{i}{X}_{j}}{h}\right)\text{}\text{}\frac{2}{N(N1){h}^{k}}\sum _{i}\sum _{j\ne i}K\left(\frac{{X}_{i}{X}_{j}}{h}\right),$$
where
${K}^{(2)}(a)\equiv \int K(az)K(z)\mathrm{\partial}z$.
(p.213)
A.1.4 Local Linear Regression
The kernel estimator $\hat{\rho}(x)$ can be obtained by minimizing the following with respect to (wrt) $a$:
$$\sum _{i}({Y}_{i}a{)}^{2}\cdot K\left(\frac{{X}_{i}x}{h}\right);$$
$\hat{\rho}(x)$ may be viewed as predicting
${Y}_{i}$ locally around
$x$ using only the intercept
$a$. A variation of
$\hat{\rho}(x)$ is obtained using a line (an intercept and a slope) centered at
$x$, which is
local linear regression (LLR) minimizing
$$\sum _{i}\{{Y}_{i}a{b}^{\mathrm{\prime}}({X}_{i}x){\}}^{2}\cdot K\left(\frac{{X}_{i}x}{h}\right)$$
wrt
$a$ and
$b$. The intercept estimator
$\hat{a}(x)$ for
$a$ is the LLR estimator for
$\rho (x)$, whereas the slope estimator
$\hat{b}(x)$ for
$b$ is the LLR estimator for the derivative
$\mathrm{\partial}\rho (x)/\mathrm{\partial}x$. See
Fan (1996) for LLR in general.
To be specific, the LLR estimator for $\rho (x)$ is (${0}_{1\times k}$ is the $1\times k$ null vector)
$$\hat{a}(x)=(1,{0}_{1\times k})\cdot \{X(x{)}^{\mathrm{\prime}}W(x)X(x){\}}^{1}\cdot \{X(x{)}^{\mathrm{\prime}}W(x)Y\},$$
where
$Y\equiv ({Y}_{1},\dots ,{Y}_{N}{)}^{\mathrm{\prime}}$,
$W(x)\equiv diag\{K(({X}_{1}x)/h),\dots ,K(({X}_{N}x)/h)\}$, and
$$\underset{N\times (1+k)}{X(x)}\equiv \left[\begin{array}{c}1,\text{}({X}_{1}x{)}^{\mathrm{\prime}}\\ \vdots \\ 1,\text{}({X}_{N}x{)}^{\mathrm{\prime}}\end{array}\right].$$
Compared with the LLR estimator $\hat{a}(x)$, the usual kernel estimator $\hat{\rho}(x)$ may be called the ‘local constant regression (LCR)’ estimator. Relatively speaking, LLR is less biased but more variable than LCR; this is the classic tradeoff between bias and variance. The advantage of being less biased in LLR tends to be visible around boundary points of $X$ support and peaks and troughs of $E(YX=x)$.
In Figure A.1 with $N=200$, we generated $Y$ with $Y=X{X}^{2}+U$ where $U,X\sim N(0,1)$ with $U\u2a3fX$, and LCR and LLR estimates were obtained with $K=\varphi $ (the $N(0,1)$ kernel). In the left panel, we set $h=0.5\times SD(X){N}^{1/5}$, and $h=2.5\times SD(X){N}^{1/5}$ in the right panel is five times greater than the $h$ in the left panel. The solid lines are the true regression function $E(YX)=X{X}^{2}$, whereas the dashed and dotted lines are the LCR and LLR estimates, respectively. In the left panel, LCR and LLR are almost the same, and both are undersmoothed in view of the wiggly parts. In the right panel, both are oversmoothed with the larger bandwidth, but LCR is more biased than LLR as LCR clearly oversmooths the peak.
A.2 Bootstrap
This section reviews bootstrap, drawing on Lee (2010a) who in turn drew on Hall (1992), Efron and Tibshirani (1993), Shao and Tu (1995),
(p.214)
Davison and Hinkley (1997), Horowitz (2001), and Efron (2003). See also Van der Vaart (1998), Lehmann and Romano (2005), and DasGupta (2008). In the main text, we mentioned ‘nonparametric (or empirical) bootstrap’ many times to simplify asymptotic inference. Hence, before embarking on the review of bootstrap in general, we quickly explain nonparametric bootstrap in the following.
Given an original sample of size $N$ and an estimate ${\hat{\alpha}}_{N}$ for a parameter $\alpha $, (i) resample from the original sample with replacement to construct a pseudo sample of size $N$; (ii) apply the same estimation procedure to the pseudo sample to get a pseudo estimate ${\hat{\alpha}}_{N}^{b}$; (iii) repeat this $B$ times (e.g., $B=500$—the higher the better) to obtain ${\hat{\alpha}}_{N}^{b}$, $b=1,\dots ,B$; (iv) use quantiles for each component of ${\hat{\alpha}}_{N}^{(b)}$ to construct a confidence interval (CI) for the corresponding component of $\alpha $; for example, the 0.025 and 0.975 quantiles for the second components of $({\hat{\alpha}}_{N}^{1},\dots ,{\hat{\alpha}}_{N}^{B})$ gives a 95% CI for the second component of $\alpha $.
Instead of CIs, sometimes the variance estimator ${B}^{1}\sum _{b=1}^{B}({\hat{\alpha}}_{N}^{b}{\hat{\alpha}}_{N})({\hat{\alpha}}_{N}^{b}{\hat{\alpha}}_{N}{)}^{\mathrm{\prime}}$ is used as an asymptotic variance for $\hat{\alpha}\alpha $. Although CIs from the bootstrap are consistent as long as the estimation procedure is “smooth,” the consistency of the variance estimator is not known in general.
In the online appendix, the program ‘BootAvgSim’ illustrates how to do nonparametric bootstrap (as well as ‘bootstrap percentilet method’ to be explained below) for mean. The program ‘RegImpPsNprSim’ shows how to implement nonparametric bootstrap in regression imputation approach, which can be easily modified for other approaches’ bootstrap.
(p.215)
A.2.1 Review on Usual Asymptotic Inference
Statistical inference is conducted with CI and hypothesis test (HT). For a $k\times 1$ parameter $\beta $ and an estimator ${b}_{N}{\to}^{p}\beta $, CI and HT are done using the asymptotic distribution of a transformation of ${b}_{N}$: in most cases, for some variance $V$,
$$\sqrt{N}({b}_{N}\beta )\u21ddN(0,V)\text{}\u27f9\text{}\sqrt{N}{V}^{1/2}({b}_{N}\beta )\u21ddN(0,{I}_{k}).$$
The test statistic (TS)
$\sqrt{N}{V}^{1/2}({b}_{N}\beta )$ is
asymptotically pivotal because its asymptotic distribution is a
known distribution as in
$N(0,{I}_{k})$.
To do inference with CI, note $\sqrt{N}({t}^{\mathrm{\prime}}{b}_{N}{t}^{\mathrm{\prime}}\beta )\u21ddN(0,{t}^{\mathrm{\prime}}Vt)$ for a known $k\times 1$ vector $t$. With ${\zeta}_{\alpha}$ denoting the $\alpha $ quantile of $N(0,1)$ and ${V}_{N}{\to}^{p}V$, as $N\to \mathrm{\infty}$,
$$\begin{array}{rcl}& & P\{{\zeta}_{1\text{}\alpha /2}<\frac{\sqrt{N}({t}^{\mathrm{\prime}}{b}_{N}{t}^{\mathrm{\prime}}\beta )}{\sqrt{{t}^{\mathrm{\prime}}{V}_{N}t}}<{\zeta}_{1\text{}\alpha /2}\}\\ & & \phantom{\rule{1em}{0ex}}\to P\{{\zeta}_{1\text{}\alpha /2}<N(0,1)<{\zeta}_{1\text{}\alpha /2})=1\alpha \\ & \phantom{\rule{1em}{0ex}}\u27f9\text{}P\{{t}^{\mathrm{\prime}}{b}_{N}{\zeta}_{1\alpha /2}\frac{\sqrt{{t}^{\mathrm{\prime}}{V}_{N}t}}{\sqrt{N}}<{t}^{\mathrm{\prime}}\beta <{t}^{\mathrm{\prime}}{b}_{N}+{\zeta}_{1\alpha /2}\frac{\sqrt{{t}^{\mathrm{\prime}}{V}_{N}t}}{\sqrt{N}})\to 1\alpha .\end{array}$$
This gives a CI for
${t}^{\mathrm{\prime}}\beta $; for example,
$t=(0,\dots ,0,1{)}^{\mathrm{\prime}}$ and
$\alpha =0.05$ yields a symmetric asymptotic 95% CI for
${\beta}_{k}$. For
${H}_{0}:{t}^{\mathrm{\prime}}\beta =c$ for a specified value of
$c$ (typically
$c=0$), we reject the
${H}_{0}$ if
$c$ is not captured by the CI. The false rejection probability (i.e., the type I error) is
$\alpha $.
Alternatively to using CI, we can use an asymptotically pivotal TS to conduct a HT: if the realized value of the TS is “extreme” for the known asymptotic distribution under ${H}_{0}$, then the ${H}_{0}$ is rejected. For instance, under ${H}_{0}:{t}^{\mathrm{\prime}}\beta =c$, we can use
$$\frac{\sqrt{N}({t}^{\mathrm{\prime}}{b}_{N}c)}{\sqrt{{t}^{\mathrm{\prime}}{V}_{N}t}}\u21ddN(0,1),\text{}\text{}\text{}\text{}\text{}\text{where the unknown}{t}^{\mathrm{\prime}}\beta \text{is replaced by}c\text{in}{H}_{0}\text{.}$$
For twosided tests, we choose the critical region
$(\mathrm{\infty},{\zeta}_{1\alpha /2})$ and
$({\zeta}_{1\alpha /2},\mathrm{\infty})$, and reject
${H}_{0}$ if the realized value of the TS falls in the critical region (with the false rejection probability
$\alpha $). A better way might be looking at the
pvalue $$2\times P\{N(0,1)>\left\text{realized value of}\frac{\sqrt{N}({t}^{\mathrm{\prime}}{b}_{N}c)}{\sqrt{{t}^{\mathrm{\prime}}{V}_{N}t}}\right\}$$
to reject the
${H}_{0}$ if the pvalue is smaller than
$\alpha $. For onesided test, this HT scenario requires minor modifications.
Although CI and HT are equivalent to (i.e., “dual” to) each other in the case of using $\sqrt{N}({b}_{N}\beta )\u21ddN(0,V)$, there are many HTs whose corresponding CIs are hard to think of. For instance, ${H}_{0}:$ the distribution of $Y$ is symmetric about $0$, or ${H}_{0}:E({Y}^{4})=3E({Y}^{2})$.
(p.216)
A.2.2 Bootstrap to Find Quantiles
Define the exact distribution function (df) for a statistic ${T}_{N}(F)$:
$${G}_{N}(c;F)\equiv P\{{T}_{N}(F)\le c\},\text{where}{T}_{N}(F)\equiv {V}_{N}(F{)}^{1/2}\sqrt{N}\{{b}_{N}(F)\beta (F)\},$$
where
$F$ is the distribution for the original sample and
${V}_{N}$ is a ‘scaling number (matrix)’. Regard
$\beta $ as a scalar for simplification. Keep in mind the distinction between a (probability) distribution and its df; a df is just a deterministic function.
We desire ${G}_{N}(c;F)$: how ${T}_{N}(F)$ behaves with a given sample of size $N$ when the sample is drawn from the true distribution $F$. The last display makes it explicit that the exact, not asymptotic, distribution of ${T}_{N}(F)$ depends on the underlying distribution $F$. The usual large sample inference in the preceding section uses the approximation (the ‘asymptotic df’ of ${T}_{N}(F)$) for ${G}_{N}(c,F)$:
$${G}_{\mathrm{\infty}}(c;F)\equiv \underset{N\to \mathrm{\infty}}{lim}{G}_{N}(c,F).$$
Often
${T}_{N}(F)$ is
asymptotically pivotal:
${G}_{\mathrm{\infty}}(c;F)$ does not depend on
$F$; for example,
${G}_{\mathrm{\infty}}(c,F)=P\{N(0,{I}_{k})\le c\}$. We may then write just
${G}_{\mathrm{\infty}}(c)$ instead of
${G}_{\mathrm{\infty}}(c;F)$. In this case, the large sample approximation
${G}_{\mathrm{\infty}}(c;F)$ to
${G}_{N}(c;F)$ is done only through one route (“through the subscript”). “Tworoute” approximation is shown next.
Suppose ${T}_{N}(F)$ is not asymptotically pivotal; for example, ${G}_{\mathrm{\infty}}(c,F)=\mathrm{\Phi}\{c/\sigma (F)\}$ where the parameter of interest is the mean and $\sigma (F)$ is the SD. In this nonpivotal case, the nuisance parameter $\sigma (F)$ should be replaced by an estimator, say, ${s}_{N}\equiv \sigma ({F}_{N})$. In a case like this with an asymptotically nonpivotal ${T}_{N}(F)$, ${G}_{\mathrm{\infty}}(c,{F}_{N})$ is used as a large sample approximation for ${G}_{N}(c;F)$ due to the estimated nuisance parameter: two routes of approximation are done between ${G}_{N}(c;F)$ and ${G}_{\mathrm{\infty}}(c,{F}_{N})$, through the subscript $\mathrm{\infty}$ and ${F}_{N}$.
Suppose that ${G}_{N}(c,F)$ is smooth in $F$ in the sense
$${G}_{N}(c;{F}_{N}){G}_{N}(c;F){\to}^{p}0\text{as}N\to \mathrm{\infty},\text{}\text{where}{F}_{N}\text{is the empirical distribution for}F;$$
recall that the empirical distribution
${F}_{N}$ gives probability
${N}^{1}$ to each observation
${Z}_{i}$,
$i=1,\dots ,N$.
Bootstrap uses ${G}_{N}(c;{F}_{N})$ as an approximation to ${G}_{N}(c;F)$ where the approximation is done only through
${F}_{N}$. This is in contrast to the large sample approximation
${G}_{\mathrm{\infty}}(c)$ or
${G}_{\mathrm{\infty}}(c,{F}_{N})$ to
${G}_{N}(c,F)$.
Whether the last display holds depends on the smoothness of ${G}_{N}(c;F)$ as a functional of $F$. This also shows that consistent estimators for $F$ other than ${F}_{N}$ (e.g., a smoothed version of ${F}_{N}$) may be used in place of ${F}_{N}$. This is the basic bootstrap idea: replace $F$ with ${F}_{N}$ and do the same thing as with $F$. Since the smoothness of ${G}_{N}(c,F)$ is the key ingredient for bootstrap, if the “source” ${T}_{N}(F)$ is not smooth in $F$, bootstrap either will not work as well (e.g., quantile regression is “onedegree” less smooth than LSE, and bootstrap works for quantile regression in a weaker sense than for LSE) or does not work at all. Bear in mind the different versions of $G$ that appeared so far:

Nonoperational 
Operational 
Finite Sample 
${G}_{N}(c;F)$ for target 
${G}_{N}(c;{F}_{N})\text{}$in bootstrap 
Asymptotic 
${G}_{\mathrm{\infty}}(c;F)$ 
${G}_{\mathrm{\infty}}(c)$ (pivotal); ${G}_{\mathrm{\infty}}(c;{F}_{N})$ (nonpivotal) 
(p.217)
Using ${G}_{N}(c;{F}_{N})$ means treating the original sample $({Z}_{1},\dots ,{Z}_{N})$ as the population—that is, the “population distribution” is multinomial with $P(Z={Z}_{i})={N}^{1}$. Specifically, with $F$ replaced by ${F}_{N}$, we have
$${G}_{N}(c;{F}_{N})=P\{{T}_{N}({F}_{N})\le c\}=P[{V}_{N}({F}_{N}{)}^{1/2}\sqrt{N}\{{b}_{N}({F}_{N})\beta ({F}_{N})\}\le c]$$
and
$\beta ({F}_{N})$ is the parameter for the empirical distribution. For instance, suppose
$\beta (F)=E(Z)=\int z\mathrm{\partial}F(z)$ and the estimator for
$\beta $ is the sample mean
${b}_{N}=\overline{Z}$. Considering a pseudo sample
${Z}_{1}^{\ast},\dots ,{Z}_{N}^{\ast}$ drawn from
${F}_{N}$ with replacement—some observations in the original sample get drawn multiple times while some get never drawn—we have
$$\begin{array}{rcl}\beta ({F}_{N})=& & \phantom{\rule{thickmathspace}{0ex}}\int z\mathrm{\partial}{F}_{N}(z)=\frac{1}{N}\sum _{i}{Z}_{i}=\overline{Z}\\ & & \phantom{\rule{thickmathspace}{0ex}}\text{as}{F}_{N}\text{assigns weight}\frac{1}{N}\text{to each support point}{Z}_{i},\\ {b}_{N}({F}_{N})=& & {\overline{Z}}^{\ast}\equiv \frac{1}{N}\sum _{i}{Z}_{i}^{\ast},\\ & & \phantom{\rule{thickmathspace}{0ex}}\text{pseudo sample mean estimator for the parameter}\beta ({F}_{N})=\overline{Z},\\ V({F}_{N})=& & \frac{1}{N}\sum _{i}{Z}_{i}^{2}{\overline{Z}}^{2}=\frac{1}{N}\sum _{i}({Z}_{i}\overline{Z}{)}^{2},\\ & & \phantom{\rule{thickmathspace}{0ex}}\text{which is also the sample variance \u2018}{V}_{N}(F)\text{\u2019,}\\ {V}_{N}({F}_{N})=& & \frac{1}{N}\sum _{i}{Z}_{i}^{\ast 2}{\overline{Z}}^{\ast 2}=\frac{1}{N}\sum _{i}({Z}_{i}^{\ast}{\overline{Z}}^{\ast}{)}^{2},\\ & & \phantom{\rule{thickmathspace}{0ex}}\text{pseudo sample variance to estimate}V({F}_{N}).\end{array}$$
This example illustrates that bootstrap approximates the distribution of (scaled)
$ZE(Z)$ with that of (scaled)
${\overline{Z}}^{\ast}\overline{Z}$. That is, the
relationship of ${b}_{N}=\overline{Z}$ to $\beta =E(Z)$ is inferred from that of ${b}_{N}^{\ast}={\overline{Z}}^{\ast}$ to ${b}_{N}=\overline{Z}$.
${G}_{N}(c;{F}_{N})$ may look hard to get, but it can be estimated as precisely as desired because ${F}_{N}$ is known. One pseudo sample of size $N$ gives one realization of ${T}_{N}({F}_{N})$. Repeating this ${N}_{B}$ times yields ${N}_{B}$many pseudo realizations, ${b}_{N}^{\ast (1)},\dots ,{b}_{N}^{\ast ({N}_{B})}$. Due to a LLN applied with the “population distribution ${F}_{N}$ for the pseudo sample”, we get
$$\frac{1}{{N}_{B}}\sum _{j=1}^{{N}_{B}}1[{V}_{N}^{\ast (j)1/2}\sqrt{N}({b}_{N}^{\ast (j)}{b}_{N})\le c]\to {G}_{N}(c;{F}_{N})\text{as}{N}_{B}\to \mathrm{\infty}.$$
(p.218)
This convergence is ‘in probability’ or ‘a.e.’ conditional on the original sample
${Z}_{1},\dots ,{Z}_{N}$. Hence there are two phases of approximation in bootstrap: the first is with
${N}_{B}\to \mathrm{\infty}$ for a given
$N$ (as in this display), and the second is with
$N\to \mathrm{\infty}$ for
${G}_{N}(c;{F}_{N}){G}_{N}(c;F){\to}^{p}0$. Since we can increase
${N}_{B}$ as much as we want, we can ignore the first phase of approximation to consider the second phase only. This is the bootstrap consistency that we take as a fact here: quantiles found from the pseudo estimates are consistent for the population quantiles.
A.2.3 Percentilet and Percentile Methods
Suppose ${T}_{N}={V}_{N}^{1/2}\sqrt{N}({b}_{N}\beta )$ is asymptotically pivotal. Using bootstrap quantiles ${\xi}_{N,\alpha /2}$ and ${\xi}_{N,1\text{}\alpha /2}$ of ${T}_{N}^{\ast (1)},\dots ,{T}_{N}^{\ast ({N}_{B})}$, we can construct a $(1\alpha )100\mathrm{\%}$ bootstrap CI for $\beta $:
$$\begin{array}{rcl}& & {\xi}_{N,\alpha /2}<{V}_{N}^{1/2}\sqrt{N}({b}_{N}\beta )<{\xi}_{N,1\text{}\alpha /2}\\ & & \phantom{\rule{1em}{0ex}}\u27f9\text{}({b}_{N}{\xi}_{N,1\text{}\alpha /2}\frac{{V}_{N}^{1/2}}{\sqrt{N}},\text{}{b}_{N}{\xi}_{N,\alpha /2}\frac{{V}_{N}^{1/2}}{\sqrt{N}})\text{for}\beta .\end{array}$$
This way of constructing a CI with an asymptotically pivotal
${T}_{N}$ is called
percentilet method—‘percentile’ because percentiles (i.e., quantiles) are used and ‘t’ because
${T}_{N}$ takes the form of the usual
$t$value that is asymptotically pivotal.
There is also percentile method using ${b}_{N}$. Define the exact df for ${b}_{N}$ as
$${J}_{N}(c;F)\equiv P\{{b}_{N}(F)\le c\}.$$
The bootstrap estimator for
${J}_{N}(c,F)$ is
${N}_{B}^{1}\sum _{j=1}^{{N}_{B}}1[{b}_{N}^{\ast (j)}\le c]$. Denoting the empirical df of
${b}_{N}^{\ast (1)},\dots ,{b}_{N}^{\ast ({N}_{B})}$ as
${K}_{N}^{\ast}$, a
$(1\alpha )100\mathrm{\%}$ CI for
$\beta $ is
$$\{{K}_{N}^{\ast 1}\left(\frac{\alpha}{2}\right),\text{}{K}_{N}^{\ast 1}(1\frac{\alpha}{2})\}.$$
Different from percentilet method is that quantiles of
${b}_{N}^{\ast (1)},\dots ,{b}_{N}^{\ast ({N}_{B})}$ are used, not quantiles of
${T}_{N}^{\ast (1)},\dots ,{T}_{N}^{\ast ({N}_{B})}$. One disadvantage with this CI is that
${b}_{N}$ may fall outside the CI (or near one end of the CI). To avoid this problem, sometimes a ‘biascorrected CI’ gets used as in the following paragraph.
A twosided $(1\alpha )100\mathrm{\%}$ biascorrected CI when the asymptotic distribution is normal is, with $\mathrm{\Phi}$ being the $N(0,1)$ df,
$$(\text{}{K}_{N}^{\ast 1}[\mathrm{\Phi}\{\text{}{\zeta}_{\alpha /2}+2{\mathrm{\Phi}}^{1}({K}_{N}^{\ast}({b}_{N}))\text{}\}],\text{}\text{}\text{}{K}_{N}^{\ast 1}[\mathrm{\Phi}\{\text{}{\zeta}_{1\alpha /2}+2{\mathrm{\Phi}}^{1}({K}_{N}^{\ast}({b}_{N}))\text{}\}]\text{}).$$
If
${b}_{N}$ is the median among the pseudo estimates so that
${K}_{N}^{\ast}({b}_{N})=0.5$, then
${\mathrm{\Phi}}^{1}({K}_{N}^{\ast}({b}_{N}))=0$: the biascorrected CI reduces to the preceding
$\{{K}_{N}^{\ast 1}(\alpha /2),\text{}{K}_{N}^{\ast 1}(1\alpha /2)\}$. If
${b}_{N}$ is smaller than the pseudo estimate median, then
${K}_{N}^{\ast}({b}_{N})<0.5$, and
${\mathrm{\Phi}}^{1}({K}_{N}^{\ast}({b}_{N}))<0$:
$\text{}$the biascorrected CI shifts to the left so that
${b}_{N}$ moves to the center of the CI.
A natural question at this stage is why bootstrap inference might be preferred to the usual asymptotic inference. First, in terms of convenience, as long as the computing power allows, bootstrap is easier to use as it just repeats the same estimation procedure ${N}_{B}$ times, which makes bootstrap a “nobrain” method. Second, estimating asymptotic
(p.219)
variance may be difficult, which bootstrap avoids. Third, the bootstrap approximation error is equal to or smaller than the asymptotic approximation error; for example,
$${G}_{\mathrm{\infty}}(c;{F}_{N}){G}_{N}(c;F)={O}_{p}({N}^{1/2}),\text{whereas}{G}_{N}(c;{F}_{N}){G}_{N}(c;F)={O}_{p}({N}^{1}).$$
For asymmetric CIs, the smallerorder approximation holds only for percentilet method; for symmetric CI, it holds for both percentilet and percentile methods. Whenever possible, use percentilet bootstrap based on a pivotal statistic.
A.2.4 Nonparametric, Parametric, and Wild Bootstraps
Hypothesis testing can be done with bootstrap CIs (or confidence sets), but sometimes CIs are inappropriate—for example, various model goodnessoffit tests. In such cases, the issue of bootstrap test appears. The key issue in bootstrap test is how to impose the null hypothesis in generating pseudo samples. Although we only mentioned sampling from the original sample with replacement so far—this is nonparametric/empirical bootstrap’—bootstrap test brings about a host of other ways to generate pseudo samples, depending on how the null hypothesis is imposed.
To appreciate the importance of imposing ${H}_{0}$ on pseudo samples, suppose ‘${H}_{0}:F$ is $N(0,1)$‘. Under the ${H}_{0}$, nonparametric bootstrap would yield a pseudo sample consisting of “nearly” $N(0,1)$ random variables, and the test with nonparametric bootstrap would work because the realized TS for the original sample will be similar to the pseudo sample TSs. Now suppose that the ${H}_{0}$ is false because the true model is $N(5,1)$. In this case, we want to have the realized TS to be much different from the pseudo TSs so that the bootstrap test rejects. If we do not impose the ${H}_{0}$ in generating the pseudo samples, then both the original data and pseudo samples will be similar because they all follow more or less $N(5,1)$, resulting in no rejection. But if we impose ‘${H}_{0}:$ $F$ is $N(0,1)$‘ on the pseudo samples, then the realized TS for the original sample (centered around $5$) will differ greatly from the TSs from the pseudo sample (centered around $0$), leading to a rejection.
Suppose ${H}_{0}:f={f}_{o}(\theta )$; that is, the null model is parametric with an unknown parameter $\theta $. In this case, $\theta $ may be estimated by the MLE $\hat{\theta}$, and the pseudo data can be generated from ${f}_{o}(\hat{\theta})$. This is parametric bootstrap where imposing the ${H}_{0}$ on pseudo data is straightforward. For instance, if ${H}_{0}:F=\mathrm{\Phi}$ in binary response, then (i) $\theta $ in ${X}^{\mathrm{\prime}}\theta $ can be estimated with probit $\hat{\theta}$, (ii) a pseudo observation ${X}_{i}^{\ast}$ can be drawn from the empirical distribution of ${X}_{1},\dots ,{X}_{N}$, and (iii) ${Y}_{i}^{\ast}$ can be generated from the binary distribution with $P({Y}_{i}^{\ast}=1{X}_{i}^{\ast})=\mathrm{\Phi}({X}_{i}^{\ast \mathrm{\prime}}\hat{\theta})$.
Often we have the null model that is not fully parametric, in which case parametric bootstrap does not work and this makes imposing the null on pseudo data far from straightforward. For instance, the null model may be just a linear model ${Y}_{i}={X}_{i}^{\mathrm{\prime}}\beta +{U}_{i}$ without the distribution of $(X,U)$ specified. In this case, one way of imposing the null goes as follows. Step 1: sample ${X}_{i}^{\ast}$ from the empirical distribution of ${X}_{1},\dots ,{X}_{N}$. Step 2: sample a residual ${\hat{U}}_{i}^{\ast}$ from the empirical distribution of the residuals ${\hat{U}}_{i}\equiv {Y}_{i}{X}_{i}^{\mathrm{\prime}}{b}_{N}$, $i=1,\dots ,N$. Step 3: generate ${Y}_{i}^{\ast}\equiv {X}_{i}^{\ast}{b}_{N}+{\hat{U}}_{i}^{\ast}$. Repeat this $N$ times to get a pseudosample of size $N$.
(p.220)
In the bootstrap scheme for the linear model, ${\hat{U}}^{\ast}$ is drawn independently of $X$, which is fine if $U\u2a3fX$. But if we want to allow for heteroskedasticity, this bootstrap does not work because ${\hat{U}}^{\ast}$ is generated independently of $X$; instead, wild bootstrap is suitable: with ${X}_{i}^{\ast}={X}_{i}$, generate ${Y}_{i}^{\ast}={X}_{i}^{\ast \mathrm{\prime}}{b}_{N}+{V}_{i}^{\ast}{\hat{U}}_{i}$ where ${V}_{i}^{\ast}$ takes $\pm 1$ with probability $0.5$. Since $E({V}^{\ast})=0$ and $E({V}^{\ast 2})=1$, we get
$$\begin{array}{rcl}& & E({V}_{i}^{\ast}{\hat{U}}_{i}{X}_{i})=E({V}_{i}^{\ast}{X}_{i})E({\hat{U}}_{i}{X}_{i})=0\\ & & \phantom{\rule{1em}{0ex}}\text{and}\text{}\text{}\text{}E({V}_{i}^{\ast 2}{\hat{U}}_{i}^{2}{X}_{i})=E({V}_{i}^{\ast 2}{X}_{i})E({\hat{U}}_{i}^{{}^{2}}{X}_{i})\simeq E({U}_{i}^{2}{X}_{i}),\end{array}$$
preserving the heteroskedasticity in the pseudo sample.
A.3 Confounder Detection, IVE, and Selection Correction
A.3.1 Coherence Checks
Typically, a causal analysis for effects of $D$ on $Y$ starts with finding an association between them. Then we control for observed variables $X$ to see whether the association still stands firm despite $X$ taken out of the picture; in a sense, we try to “negate” the causal claim using $X$. For unobserved variables $\epsilon $, we cannot do the same; instead, we try to prop up the prima facie causal finding by showing that the finding is coherent, that is, consistent with auxiliary findings. Sometimes we have an idea on variables lurking in $\epsilon $, and sometimes we do not. In either case, there are a number of ways to show coherence.
Suppose that a positive effect has been found initially. One would expect that if the treatment level is increased, say to the double dose, then the effect will become stronger. Likewise, if the treatment is reduced to half the dose, then the effect will become weaker. Furthermore, if the treatment is reversed, a negative effect will occur. Of course, the true relation between $D$ and $Y$ can be highly nonlinear, being negative or positive depending on the level of $D$. Barring such cases, however, confirming those expectations supports the initial causal finding. If the expectations do not hold up, then the initial causal finding is suspect: it might have been due to some $\epsilon $. Instead of using an extra treatment group with double/half/reverse treatment, using another response not supposed to be affected by $D$ or another control group supposed to be similar to the original control group also can help detect the presence of $\epsilon $. Examples for these appear below.
Partial Treatment
Suppose we examine jobtraining effects on reemployment or not within certain days (e.g., reemployment within 100 days). The T group get the job training and the C group do not. Suppose there is a dropout group (“D” group receiving only part of the required training). The three groups are ordered in terms of treatment dose: C group with no training, D group with partial training, and T group with the full training. If there is no unobserved confounder $\epsilon $, what is expected (with $X$ controlled) is
$$C\succ D\succ T\text{}\text{(bad treatment)}\phantom{\rule{thickmathspace}{0ex}}\text{}\text{}\text{}\text{or}\phantom{\rule{thickmathspace}{0ex}}\text{}\text{}\phantom{\rule{thickmathspace}{0ex}}\phantom{\rule{thickmathspace}{0ex}}T\succ D\succ C\text{}\text{(good treatment)},$$
(p.221)
where “
$C\succ D$“ means that the reemployment probability is greater for the C group than for the D group.
Suppose that the observed finding is $D\succ C\succ T$. There are many possible scenarios for this. One is that the training is harmful, but smart trainees see this and thus drop out; the D group find jobs sooner than the C group because the D group is smarter, resulting in $D\succ C\succ T$. In this scenario, $\epsilon $ is how smart the person is. Another scenario is that the training is harmful but the D group drops out because they found a job due to a lower reservation wage, resulting in $D\succ C\succ T$. In this scenario, $\epsilon $ is reservation wage.
If one thinks further, many more scenarios will come up, possibly based on different unobserved confounders. It is not farfetched to say that in observational studies, negating those scenarios one by one to zoom in on one scenario—and thus presenting a coherent story—is the goal. In short, there is a coherence problem in the jobtraining example with $D\succ C\succ T$, which needs to be explained before declaring the treatment good or bad. Had the D group been ignored, only $C\succ T$ would have been looked at to conclude a bad job training. The partial treatment group, which is an extra treatment group using an extra dose, casts doubt on the causal finding based on only the full versus no treatment groups.
As another example, Ludwig et al. (2001) examined effects of moving into a lower poverty area on crime rates. In observational data, people have some control over where they live, and living in high/lowpoverty area has an element of selfselection, which Ludwig et al. avoided using experimental data. Since 1994, 638 families from a highpoverty area in Baltimore were randomly assigned to three groups: the T group relocating into an area with poverty rate under 10%, the D group without constraints on poverty rate for relocation, and the C group. The D group is partially treated, because they could (and some did) move into an area with poverty rate higher than 10%. The outcome variables are juvenile arrest records. A total of 279 teens were arrested 998 times in the pre and postprogram periods. The crimes were classified into violent crimes (assault, robbery), property crimes, and the other crimes (drug offenses, disorderly conduct).
Part of their Table III for juveniles of ages 11–16 is shown below (some covariates are controlled). The second column shows the mean number of arrests for 100 teens per quarter. The third column shows the treatment effect of the T versus C groups, whereas the fourth column shows the treatment effect of the D versus C groups. The two entries with one asterisk are significant at 10% level, and the entry with two asterisks is significant at 5%.
Effects of Moving into Low Poverty Area on Crimes 

Mean arrests for C 
T versus C (SD) 
D versus C (SD) 
Violent crimes 
3.0 
$$1.6 (0.8)${}^{\ast \ast}$ 
$$1.4 (0.8)${}^{\ast}$ 
Property crimes 
2.0 
1.3 (0.8) 
$$0.5 (0.8) 
Other crimes 
3.3 
$$0.7 (1.0) 
$$1.3 (1.0) 
All crimes 
8.3 
$$0.9 (1.8) 
$$3.1 (1.8)${}^{\ast}$ 
(p.222)
The treatment indeed seems to lower crime rate; notice a scope for property crime increase, because a highincome area presents more opportunity for property crime. But in terms of all crime rates, we have the ranking $C\succ T\succ D$, which is strange, for one would expect $C\succ D\succ T$. One possible scenario is that the areas for the T group may have higher arrest probabilities than the areas for the D group; in other words, crime rates are overestimated in the T group areas relative to the D group areas. In this case, arresting intensity/effort is an unobserved confounder.
Reverse Treatment
If a treatment change from 0 to 1 has a positive effect, then the reverse change from 1 to 0 should have a negative effect, which is another way of being coherent. This can be checked out with two similar groups where one group experience 0 to 1 and the other group 1 to 0; for each group, beforeandafter (BA) design is implemented. Contrast this to difference in differences (DD) where one group experience 0 to 1 and the other group no change; reverse treatment design is better than DD, because distinction between the two groups is clearer if indeed the treatment has some effect.
It is also possible to try reverse treatment design on a single group: the treatment change is 0 to 1 to see the effect, and then reversed back to 0 to see the reverse effect. If the treatment is effective, $Y$ will take on level A, B, and back to A, as the treatment changes from 0 to 1, and back to 0. Comparing this onegroup threepoint design with the preceding twogroup twopoint design (here, ‘point’ refers to time points), in the former, we do not have to worry about the difference between the two groups but we do have to be concerned about the time effect because three time points are involved. In the latter, the opposite holds.
For timeseries or panel data, suppose we use BA to find a positive effect of $D$ (water fluoridation) on $Y$ (tooth decay) over five years. In the beginning of the period, fluoridation started (treatment changing from 0 to 1) and lasted for five years. Comparing the tooth decay proportions at the beginning and end of the five year period, the proportion has been found to be lowered. But during this period, other things may have changed to affect $Y$. For instance, healthy lifestyles might have been adopted (lower sugar diet due to enhanced health concern including oral health), and this could have been the actual reason for the lower tooth decay proportion. To refute this possibility, suppose fluoridation stopped (the treatment changing from 1 to 0) and stayed that way for another five years. Suppose tooth decay proportion increased during this second fiveyear period. If unhealthy lifestyle was adopted during this period, then again this might explain the higher tooth decay proportion, which is unlikely; hence, the reverse treatment corroborates the initial finding. This example is a modified version of actual studies on fluoridation referred to in Gordis (2000, 7–9).
Multiple Responses
There have been claims on beneficial effects of moderate drinking of alcohols— particularly red wine—on heart disease. Since there are potential risks in drinking, it is difficult to do an experiment, and studies on that causal link are observational where
(p.223)
people selfselect their drinking level. Thun et al. (1997) examined a large data set on older U.S. adults with $N=$ 490,000. In 1982, the individuals reported on their drinking habits, and 46,000 died during the nineyear followup. In the study, drinking habit was measured separately for beer, wine, and spirits; the sum was then recorded as the total number of drinks per day. It was found that moderate drinking reduces death rates from cardiovascular diseases.
Part of their Table 4 for women is
Deaths (SD) per 100,000 and Number of Drinks per Day 
Cause of death 
0 
Less than 1 
1 
2–3 
4 or more 
Cirrhosis, alcoholism 
5.0 (0.9) 
4.3 (0.9) 
7.7 (1.9) 
10.4 (1.9) 
23.9 (4.5) 
Cardiovascular diseases 
335 (7.8) 
230 (7.5) 
213 (10.2) 
228(9.8) 
251(16) 
Breast cancer 
30.3(2.1) 
33.3(2.4) 
37.6(4.1) 
45.8(4.2) 
29.1(5.3) 
Injuries & external causes 
22.7 (1.9) 
25.5 (2.2) 
17.7 (2.8) 
18.9 (2.7) 
17.1 (4.0) 
Examining the cirrhosis and alcoholism row, the death rate increases as more alcohol is consumed. The death rate for cardiovascular diseases decreases for moderate drinking but increases as the number of drinks goes up. The death rate from breast cancer increases substantially but then it drops for four drinks or more, which casts some doubt on the study. The most problematic is the death rate for injuries and external causes, which is decreasing for one drink or more.
If we do a randomized study, then we would expect that drinkers have more accidents (thus a higher death rate for injuries and external causes), because being drunk makes the person less alert and less careful. Being otherwise suggests that drinkers may be systematically different from nondrinkers. Drinkers may be more careful and attentive to their health and lifestyle, and this may be the real reason for the lower cardiovascular disease death rate. Wine drinkers are sometimes reported to have healthy lifestyle in the United States. This may have to do with the fact that wines are more expensive than beers and better educated people with more money drink wines. That is, better education could be the common factor driving wine drinking and healthy lifestyle in the United States. Looking at the extra response variable (death rate due to injuries and external causes), we can see a possible hidden bias due to the unobserved confounders such as alertness/carefulness and healthy lifestyle due to high income and education.
In the drinking example, the extra response variable is expected to be affected by the treatment in a known direction. There are cases where an extra response variable is not supposed to be affected at all. For example, consider the effect of a lower speed limit on the number of traffic accidents. One unobserved confounder is police patrol intensity: it is possible that the police patrol is intensified to enforce the lower speed limit, which then reduces the number of traffic accidents, whereas the real effect of the lower speed limit per se is nil. In this example, an extra response variable can be crime rate not supposed to be affected by speed limit. If crime rate does not change following the speed limit change, then we can rule out the possibility of the intensified
(p.224)
patrol efforts affecting traffic accidents. Of course, the best thing would be to find a variable representing police patrol effort and see if it really changed. When this is not done, the next best thing would be to use the extra response variable (crime rate) to detect changes in police patrol intensity.
Multiple Control Groups
Zero is an intriguing number, and no treatment can mean many different things. With drinking as the treatment, it may mean the real nondrinkers, but it may also mean the people who used to drink heavily long time ago and then stopped for health reasons (exdrinkers). With a job training as the treatment, no treatment can mean people who never applied to the program, but it can also mean people who had applied but then were rejected. As the real nondrinkers differ from the exdrinkers, the nonapplicants differ from the rejected. In the job training example, there are two control groups: the nonapplicants and the rejected. Both groups did not receive the treatment, but they can differ in terms of unobserved confounders.
It is possible to detect the presence of unobserved confounders using multiple control groups. Let C denote the nonapplicants and C ${}_{r}$ the rejected. Suppose $E(YX,C)\ne E(YX,{C}_{r})$. This must be due to an unobserved variable $\epsilon $, raising the suspicion that the T group might be also different from C and C ${}_{r}$ in terms of $\epsilon $. Specifically, to ensure the program success, the program administrators may have “cherrypicked” applicants with higher values of $\epsilon $ that can be quality or ability. Then ${C}_{r}$ comprises people with low $\epsilon $. In this example, comparing the C group with the extra control group ${C}_{r}$ helps one see the presence of an unobserved confounder.
Card and Krueger (1994, 2000) analyzed the effect of minimum wage increase on employment. In 1992, New Jersey increased its minimum wage from $4.25 to $5.05 per hour. From New Jersey and the eastern Pennsylvania, 410 fast food restaurants were sampled before and after the minimum wage change (treatment) for DD. That is, PA fast food restaurants were used as a control group. Not just PA but also NJ fast food restaurants with starting wage higher than $5 were used as another control group because those NJ restaurants were unlikely to be affected by the treatment.
The next table is part of Table 3 in Card and Krueger (1994) and it shows the average (SD), where ‘FTE (fulltime equivalent)’ is the number of fulltime workers plus 0.5 times the number of parttime workers, ‘NJ ($4.25)’ is for the NJ restaurants with the pretreatment starting wage $4.25 (affected by the treatment), and ‘NJ($5)’ is for the NJ restaurants with the pretreatment starting wage $5 or above (little affected by the treatment).
DD with Two Control Groups for Minimum Wage Effect on Employment 

NJ 
PA 
NJPA 
NJ ($4.25) 
NJ ($5) 
NJ ($4.25)NJ($5) 
FTE before 
20.44 
23.33 
$$2.89 
19.56 
22.25 
$$2.69 
FTE after 
21.03 
21.17 
$$0.14 
20.88 
20.21 
0.67 
Difference 
0.59 
$$2.16 
2.75 (1.36) 
1.32 
$$2.04 
3.36 (1.48) 
(p.225)
From the last row of the left half, despite the minimum wage increase, NJ FTE increased whereas PA FTE decreased; the DD estimate is significantly positive, showing no negative effect of the minimum wage increase. From the right half using NJ($5) as a second control group, almost the same finding is seen. The second control group renders a coherent story, behaving similarly to the first control group. In this DD, no covariates were controlled for; Card and Krueger (1994) tried many regression models, only to conclude no evidence of employment reduction due to the minimum wage increase.
A.3.2 IVE and Complier Effect
TwoStage LSE and IVE
Imagine a health education program where individuals are randomized in ($\delta =1$) or out ($\delta =0$) to be given some education on health benefits of exercise $D$, and we are interested in the effects of exercise $D$ on health $Y$, not the effects of $\delta $ on $Y$. One concern, however, is that there may be unobserved confounders $\epsilon $ affecting both $D$ and $Y$ to make $D$ an endogenous treatment. For instance, $\epsilon $ may be laziness which affects exercise $D$ and health $Y$. In this case, a simple LSE of $Y$ on $(1,D)$ will not work. One solution for this problem is using the ‘exogenous variation’ in $D$ caused by the randomization dummy $\delta $. Since $\delta $ does not affect $Y$ directly, ‘$\delta \u27f6D\u27f6Y$‘ holds; $\delta $ affects $Y$ only indirectly through $D$. Instrumental variable estimator (IVE) can be applied with $\delta $ as an instrumental variable (IV) for $D$.
To formalize the idea, suppose, for meanzero errors $\epsilon $ and $U$,
$${D}_{i}={\alpha}_{0}+{\alpha}_{\delta}{\delta}_{i}+{\epsilon}_{i}\text{with}{\alpha}_{\delta}\ne 0\text{and}{Y}_{i}={\beta}_{0}+{\beta}_{d}{D}_{i}+{U}_{i},$$
where
$COR(D,U)\ne 0$ due to laziness lurking in
$\epsilon $ and
$U$, but
$COR(\delta ,U)=0$. The interest is on
${\beta}_{d}$ that is the slope of the endogenous treatment
$D$. The assumption
${\alpha}_{\delta}\ne 0$ is critical for
$\delta $ to give an exogenous variation to
$D$; in the foregoing example, the education on health benefits of exercise should make at least some people exercise. Doing the LSE of
$D$ on
$(1,\delta )$ to get
$({\hat{\alpha}}_{0},{\hat{\alpha}}_{\delta})$ and then doing the LSE of
$Y$ on
$(1,\hat{D})$ where
$\hat{D}\equiv {\hat{\alpha}}_{0}+{\hat{\alpha}}_{\delta}\delta $, we can estimate
${\beta}_{d}$ consistently. This is the wellknown twostage LSE, which equals the IVE below.
Rewrite the $D$ and $Y$ structural form (SF) equations as
$$\begin{array}{rcl}& & {D}_{i}={G}_{i}^{\mathrm{\prime}}\alpha +{\epsilon}_{i}\text{}\text{}\text{}\text{}\text{}\text{and}{Y}_{i}={W}_{i}^{\mathrm{\prime}}\beta +{U}_{i},\\ & & \phantom{\rule{1em}{0ex}}\text{where}{G}_{i}\equiv (1,{\delta}_{i}{)}^{\mathrm{\prime}},\text{}\text{}\text{}{W}_{i}\equiv (1,{D}_{i}{)}^{\mathrm{\prime}},\text{}\text{}\text{}\text{}\text{}\alpha \equiv ({\alpha}_{0},{\alpha}_{\delta}{)}^{\mathrm{\prime}},\text{}\text{}\text{}\beta \equiv ({\beta}_{0},{\beta}_{d}{)}^{\mathrm{\prime}}.\end{array}$$
With
${E}^{1}(\cdot )$ denoting
$\{E(\cdot ){\}}^{1}$, the IVE that is consistent for
$\beta $ is
$$\begin{array}{rcl}& & {\left(\frac{1}{N}\sum _{i}{G}_{i}{W}_{i}^{\mathrm{\prime}}\right)}^{1}\frac{1}{N}\sum _{i}{G}_{i}{Y}_{i}\stackrel{p}{\u27f6}{E}^{1}(G{W}^{\mathrm{\prime}})\cdot E(GY)\\ & & \phantom{\rule{1em}{0ex}}={E}^{1}(G{W}^{\mathrm{\prime}})\cdot E\{G({W}^{\mathrm{\prime}}\beta +U)\}=\beta +{E}^{1}(G{W}^{\mathrm{\prime}})E(GU)=\beta \text{as}E(GU)=0.\end{array}$$
Here,
$\delta $ is an IV for
$D$; we also say that
$G$ is an instrument (vector) for
$W$. With no covariates
$X$, call the IVE ‘simple IVE’.
(p.226)
As can be seen in the last equation, an IV $\delta $ has to meet two necessary conditions:
$$COR(\delta ,D)\ne 0\text{}\text{so that}{E}^{1}(G{W}^{\mathrm{\prime}})\text{exists}\text{}\text{}\text{}\text{}\text{}\text{and}\text{}\text{}COR(\delta ,U)=0\text{so that}E(GU)=0;$$
the former is the ‘
inclusion restriction’
${\alpha}_{\delta}\ne 0$—the education should be effective in inducing exercises—and the latter holds due to the randomization of
$\delta $. An additional requirement should hold that
$\delta $ do not enter the
$Y$ equation directly, which is an ‘
exclusion restriction’—
$\delta $ can influence
$Y$ only indirectly though
$D$. In short, IV should meet three conditions: inclusion restriction, exclusion restriction, and zero correlation with the model error term.
Substituting the $D$ SF into the $Y$ SF, we get the $Y$ reducedform (RF) equation:
$$Y={\beta}_{0}+{\beta}_{d}({\alpha}_{0}+{\alpha}_{\delta}\delta +\epsilon )+U=({\beta}_{0}+{\alpha}_{0}{\beta}_{d})+{\alpha}_{\delta}{\beta}_{d}\delta +(U+{\beta}_{d}\epsilon ).$$
This shows that if we are interested only in ‘
${H}_{0}:{\beta}_{d}\ne 0$‘, then we can test if the slope of
$\delta $ is zero in the LSE of
$Y$ on
$(1,\delta )$ because
${\alpha}_{\delta}\ne 0$—this LSE works because
$\delta $ is exogenous. The
$Y$ RF shows in fact more: the slope of
$\delta $ is
$${\gamma}_{\delta}\equiv {\alpha}_{\delta}{\beta}_{d},$$
which is the product of the two effects:
${\alpha}_{\delta}$ for
$\delta $ on
$D$, and
${\beta}_{d}$ for
$D$ on
$Y$. This shows yet another way of finding
${\beta}_{d}$: do the LSE of
$D$ on
$(1,\delta )$ to get
${\hat{\alpha}}_{\delta}$ and
$Y$ on
$(1,\delta )$ to get
${\hat{\gamma}}_{\delta}$, and finally the ratio
${\hat{\gamma}}_{\delta}/{\hat{\alpha}}_{\delta}$ for
${\beta}_{d}$.
So far, we introduced three ways of finding ${\beta}_{d}$: twostage LSE, simple IVE, and the ratio ${\hat{\gamma}}_{\delta}/{\hat{\alpha}}_{\delta}$. Although they may look different, they are numerically the same. The fact that twostage LSE and simple IVE are the same is well known (see, e.g., Lee 2010a), and the equivalence of the simple IVE slope and the ratio ${\hat{\gamma}}_{\delta}/{\hat{\alpha}}_{\delta}$ will be seen shortly under the name ‘Wald estimator’.
Suppose covariates $X$ with $COR(X,U)=0$ appears in the $D$ and $Y$ equations as in
$${D}_{i}={\alpha}_{0}+{\alpha}_{\delta}{\delta}_{i}+{X}_{i}^{\mathrm{\prime}}{\alpha}_{x}+{\epsilon}_{i}\text{and}{Y}_{i}={\beta}_{0}+{\beta}_{d}{D}_{i}+{X}_{i}^{\mathrm{\prime}}{\beta}_{x}+{U}_{i}.$$
Under
$COR(\delta ,DX)\ne 0$ and
$COR(\delta ,UX)=0$, IVE takes the same form as the simple IVE, but with
$${G}_{i}\equiv (1,{\delta}_{i},{X}_{i}^{\mathrm{\prime}}{)}^{\mathrm{\prime}},\text{}\text{}\text{}{W}_{i}\equiv (1,{D}_{i},{X}_{i}^{\mathrm{\prime}}{)}^{\mathrm{\prime}},\text{}\alpha \equiv ({\alpha}_{0},{\alpha}_{\delta},{\alpha}_{x}^{\mathrm{\prime}}{)}^{\mathrm{\prime}},\text{}\beta =({\beta}_{0},{\beta}_{d},{\beta}_{x}^{\mathrm{\prime}}{)}^{\mathrm{\prime}}.$$
When there are more instruments than necessary so that
$G$ has more elements than
$W$, a generalized version of IVE or ‘generalized method of moments’ (GMM) can be applied. More than enough IVs appear naturally when a conditional moment condition such as
$E(UZ)=0$ is available because functions of
$Z$ are candidate IVs.
Wald Estimator
In the LSE of $Y$ on $(1,D)$, the slope estimator equals the sample mean difference of the two groups $D=0,1$, which was proven in Chapter 1. There is an analogous relation as follows between ‘simple IVE with a binary instrument $\delta $ for a binary regressor $D$‘ and ‘the ratio of the group mean differences’ $E(Y\delta =1)E(Y\delta =0)$ and $E(D\delta =1)E(d\delta =0)$.
(p.227)
Recall the simple IVE whose slope is consistent for
$$\frac{COV(Y,\delta )}{COV(D,\delta )}=\frac{E(Y\delta )E(Y)E(\delta )}{E(D\delta )E(D)E(\delta )};$$
if
$D=\delta $, this equals the slope parameter
$COV(Y,\delta )/V(\delta )$ of the
$\text{}$LSE of
$Y$ on
$(1,\delta )$. Observe
$$E(D\delta )=E(D\delta =1)P(\delta =1)\text{and}D=D(\delta +1\delta ).$$
Rewrite the denominator
$E(D\delta )E(D)E(\delta )$ in the preceding display as
$$\begin{array}{rcl}& & E(D\delta =1)P(\delta =1)E\{D(\delta +1\delta )\}\cdot P(\delta =1)\\ & & \phantom{\rule{1em}{0ex}}=E(D\delta =1)P(\delta =1)E(D\delta )P(\delta =1)E\{D(1\delta )\}P(\delta =1)\\ & & \phantom{\rule{1em}{0ex}}=E(D\delta =1)P(\delta =1)E(D\delta =1)P(\delta =1{)}^{2}E(D\delta =0)P(\delta =0)P(\delta =1)\\ & & \phantom{\rule{1em}{0ex}}=E(D\delta =1)P(\delta =0)P(\delta =1)E(D\delta =0)P(\delta =0)P(\delta =1).\end{array}$$
Analogously, the numerator
$E(Y\delta )E(Y)E(\delta )$ equals
$$E(Y\delta =1)P(\delta =0)P(\delta =1)E(Y\delta =0)P(\delta =0)P(\delta =1).$$
Canceling
$P(\delta =0)P(\delta =1)$ that appears in both the denominator and numerator gives
$$\frac{COV(Y,\delta )}{COV(D,\delta )}=\frac{E(Y\delta =1)E(Y\delta =0)}{E(D\delta =1)E(D\delta =0)}.$$
The sample version for the last ratio of the group mean differences is the Wald estimator:
$$(\frac{\sum _{i}{Y}_{i}{\delta}_{i}}{\sum _{i}{\delta}_{i}}\frac{\sum _{i}{Y}_{i}(1{\delta}_{i})}{\sum _{i}(1{\delta}_{i})})\cdot {(\frac{\sum _{i}{D}_{i}{\delta}_{i}}{\sum _{i}{\delta}_{i}}\frac{\sum _{i}{D}_{i}(1{\delta}_{i})}{\sum _{i}(1{\delta}_{i})})}^{1}.$$
In the causal route
$\delta \to D\to Y$, the numerator of the Wald estimator is for the multiplicative indirect effect
${\alpha}_{\delta}{\beta}_{d}$ of
$\delta $ on
$Y$, and the denominator is for the effect
${\alpha}_{\delta}$ of
$\delta $ on
$D$; by the division, the direct effect
${\beta}_{d}$ of
$D$ on
$Y$ is recovered. This is the aforementioned equivalence of simple IVE to the LSEbased ratio
${\hat{\gamma}}_{\delta}/{\hat{\alpha}}_{\delta}$.
In a clinical trial where $\delta $ is a random assignment and $D$ is “compliance” if $D=\delta $ and “noncompliance” if $D\ne \delta $, $E(Y\delta =1)E(Y\delta =0)$ is called the ‘intenttotreat effect’, because it shows the effect of treatment intention (i.e., assignment), not of the actual treatment received. Noncompliance to treatment dilutes the true effect, and the Wald estimator blows up the diluted effect with the factor $\{E(D\delta =1)E(D\delta =0){\}}^{1}$. This is the ‘rescaling’ role of the Wald estimator denominator.
So far, a constant treatment effect has been assumed that is the same for all individuals. If treatment effect is heterogeneous to vary across individuals, then IVE can be inconsistent. To see this, recall $Y={Y}^{0}+({Y}^{1}{Y}^{0})D$ and suppose that the individual effect ${Y}^{1}{Y}^{0}$ is not a constant but ${Y}_{i}^{1}{Y}_{i}^{0}={\beta}_{d}+{V}_{i}\text{}$with $\text{}E(V)=0$.
(p.228)
Then
$$\begin{array}{rcl}& & Y={Y}^{0}+({\beta}_{d}+V)D={\beta}_{d}D+({Y}^{0}+VD)\\ & & \text{}=E({Y}^{0}+VD)+{\beta}_{d}D\text{}+\text{}\{{Y}^{0}+VDE({Y}^{0}+VD)\},\end{array}$$
where
$E({Y}^{0}+VD)$ is the intercept and the term in
$\{\cdot \}$ is the error. The trouble is
$VD$ in the error term, because the instrument
$\delta $ may be related to
$VD$ as
$COR(\delta ,D)\ne 0$; if
$V\u2a3f(D,\delta )$, then IVE is consistent because
$E(\delta VD)=E(V)E(\delta D)=0$. Since
$V$ is part of the treatment effect
${Y}^{1}{Y}^{0}$, ‘
$V\u2a3f(D,\delta )$‘ would be questionable at best. Despite this problem due to heterogeneous effects, IVE is still consistent for an interesting parameter, as is shown next.
Wald Estimator for Effect on Compliers
Since $\delta $ affects $D$, we can imagine potential treatments $({D}^{0},{D}^{1})$ depending on $\delta =0,1$, analogously to the potential responses $({Y}^{0},{Y}^{1})$ depending on $D=0,1$. We only observe $(\delta ,D,Y)$, although $(\delta ,{D}^{0},{D}^{1},{Y}^{0},{Y}^{1})$ is considered. Classify the individuals with $({D}^{0},{D}^{1})$, following Imbens and Angrist (1994) and Angrist et al. (1996):
$$\begin{array}{rcl}{D}^{0}& & =0,{D}^{1}=0:\text{never takers (of the treatment, no matter what}\delta \text{is);}\\ {D}^{0}& & =0,{D}^{1}=1:\text{compliers (taking treatment only when}\delta =1\text{);}\\ {D}^{0}& & =1,{D}^{1}=0:\text{defiers (taking treatment only when}\delta =0\text{);}\\ {D}^{0}& & =1,{D}^{1}=1:\text{always takers (no matter what}\delta \text{is).}\end{array}$$
For the exercise ($D$) example, nevertakers never exercise regardless of the education $\delta $ on benefits of exercise; compliers exercise only when educated ($\delta =1$); defiers exercise only when not educated ($\delta =0$); alwaystakers always exercise regardless of $\delta $. We should observe both ${D}^{0}$ and ${D}^{1}$ to know the type of a person, but only one of ${D}^{0}$ and ${D}^{1}$ is observed; hence, the type is unknown. Since the grouping based on $({D}^{0},{D}^{1})$ is not affected by $\delta $, this is a ‘principal stratification’ (Frangakis and Rubin 2002). In contrast, the membership for the $D=0$ or $D=1$ group changes as $\delta $ changes, so long as $\delta $ affects $D$. For instance, the compliers belong to $D=0$ (along with the nevertakers) when $\delta =0$, but they belong to $D=1$ (along with the alwaystakers) when $\delta =1$.
Suppose
Condition (a) is the inclusion restriction that $\delta $ is in the $D$ equation to affect $D$. Condition (b) amounts to the exclusion restriction that $\delta $ is not in the ${Y}^{0}$ and ${Y}^{1}$ equations. Condition (c) is a monotonicity assumption.
One example in which the three conditions hold is
$${Y}^{d}={\beta}_{d}+{U}^{d},\text{}d=0,1,\text{}\text{}\text{}D=1[{\alpha}_{0}+{\alpha}_{\delta}\delta +\epsilon >0],\text{}{\alpha}_{\delta}\ne 0,\text{}\delta \u2a3f(\epsilon ,{U}^{0},{U}^{1}).$$
(p.229)
Here,
${Y}^{1}{Y}^{0}={\beta}_{1}{\beta}_{0}+V$ with
$V\equiv {U}^{1}{U}^{0}$: the effect varies across individuals. Condition (a) holds due to
${\alpha}_{\delta}\ne 0$. Condition (b) holds because
$\delta $ is independent of
$(\epsilon ,{U}^{0},{U}^{1})$ and
$$({Y}^{0},{Y}^{1},{D}^{0},{D}^{1})=({\beta}_{0}+{U}^{0},\text{}{\beta}_{1}+{U}^{1},\text{}1[{\alpha}_{0}+\epsilon >0],\text{}1[{\alpha}_{0}+{\alpha}_{\delta}+\epsilon >0]).$$
Condition (c) holds with
$\le $ or
$\ge $ depending on
${\alpha}_{\delta}\gtrless 0$. We can allow
${\alpha}_{\delta}$ to vary across individuals (say,
${\alpha}_{\delta i}$) without disturbing the above conditions, as long as all
${\alpha}_{\delta i}$‘s take the same sign. Without loss of generality,
assume ${D}^{0}\le {D}^{1}$ to rule out defiers from now and onward.
Observe
$$\begin{array}{rcl}& & E(Y\delta =1)E(Y\delta =0)=E\{D{Y}^{1}+(1D){Y}^{0}\delta =1\}\\ & & \phantom{\rule{2em}{0ex}}E\{D{Y}^{1}+(1D){Y}^{0}\delta =0\}\\ & & \phantom{\rule{1em}{0ex}}=E\{{D}^{1}{Y}^{1}+(1{D}^{1}){Y}^{0}\delta =1\}E\{{D}^{0}{Y}^{1}+(1{D}^{0}){Y}^{0}\delta =0\}\\ & & \phantom{\rule{2em}{0ex}}\text{(}\delta \text{}\\ & & \phantom{\rule{1em}{0ex}}=E\{{D}^{1}{Y}^{1}+(1{D}^{1}){Y}^{0}\}E\{{D}^{0}{Y}^{1}+(1{D}^{0}){Y}^{0}\}\text{}\text{}\text{(due to (b))}\\ & & \phantom{\rule{1em}{0ex}}=E\{({D}^{1}{D}^{0})({Y}^{1}{Y}^{0})\}\text{}\text{}\\ & & \phantom{\rule{1em}{0ex}}=E({Y}^{1}{Y}^{0}{D}^{1}{D}^{0}=1)P({D}^{1}{D}^{0}=1)\\ & & \phantom{\rule{2em}{0ex}}\text{(no defier implies}P({D}^{1}{D}^{0}=1)=0\text{).}\end{array}$$
Since ${D}^{1}{D}^{0}=1\u27fa{D}^{1}=1,{D}^{0}=0$ (complier), dividing the first and last expressions with $P({D}^{1}{D}^{0}=1)$ gives the effect on the compliers
$$\begin{array}{rcl}E({Y}^{1}{Y}^{0}{D}^{1}=1,{D}^{0}=0)=& & \frac{E(Y\delta =1)E(Y\delta =0)}{P({D}^{1}=1,{D}^{0}=0)}\\ =& & \frac{E(Y\delta =1)E(Y\delta =0)}{E(D\delta =1)E(D\delta =0)},\end{array}$$
which is the Wald estimator probability limit; the last equality holds because
$$\begin{array}{rcl}& & E(D\delta =1)E(D\delta =0)=P(D=1\delta =1)P(D=1\delta =0)\\ & & \phantom{\rule{1em}{0ex}}=P(\text{always taker or complier})P(\text{always taker})=P(\text{complier}).\end{array}$$
The effect on compliers is also called the ‘local average treatment effect’ (LATE) (Imbens and Angrist 1994). The qualifier ‘local’ refers to the fact that LATE is specific to the instrument in use. Bear in mind that the LATE interpretation of the simple IVE (i.e., Wald estimator) requires the above three conditions, and that LATE can change as the instrument in use changes. If IVE changes as the instrument changes, then this is an indication for heterogeneous treatment effects (or some instruments may be invalid). IVE estimating the effect on those who change their behavior as the instrument value changes looks natural.
Abadie (2003) showed that the effect on the compliers can be written also as
$$\frac{E(YD\delta =1)E(YD\delta =0)}{E(D\delta =1)E(D\delta =0)}\frac{E\{Y(1D)\delta =1\}E\{Y(1D)\delta =0\}}{E(1D\delta =1)E(1D\delta =0)}.$$
(p.230)
The proof is similar to that for
$E\{({D}^{1}{D}^{0})({Y}^{1}{Y}^{0})\}$. For the first term, observe
$$\begin{array}{rcl}& & E(YD\delta =1)E(YD\delta =0)=E({Y}^{1}{D}^{1}\delta =1)E({Y}^{1}{D}^{0}\delta =0)\\ & & \phantom{\rule{1em}{0ex}}=E({Y}^{1}{D}^{1})E({Y}^{1}{D}^{0})\\ & & \phantom{\rule{1em}{0ex}}=E\{{Y}^{1}({D}^{1}{D}^{0})\}=E({Y}^{1}{D}^{1}{D}^{0}=1)P({D}^{1}{D}^{0}=1);\end{array}$$
$E({Y}^{1}{D}^{0})$, not
$E({Y}^{0}{D}^{0})$, appears because
${D}^{0}=1$ means treated. Divide the first and last expressions by
$E(D\delta =1)E(D\delta =0)$ to obtain
$$E({Y}^{1}\text{complier})=\frac{E(YD\delta =1)E(YD\delta =0)}{E(D\delta =1)E(D\delta =0)}.$$
Analogously, the second term holds due to the following:
$$\begin{array}{rcl}& & E\{Y(1D)\delta =1\}E\{Y(1D)\delta =0\}\\ & & \phantom{\rule{1em}{0ex}}=E\{{Y}^{0}(1{D}^{1})\delta =1\}E\{{Y}^{0}(1{D}^{0})\delta =0\}\\ & & \phantom{\rule{1em}{0ex}}=E\{{Y}^{0}(1{D}^{1})\}E\{{Y}^{0}(1{D}^{0})\}\\ & & \phantom{\rule{1em}{0ex}}=E\{{Y}^{0}({D}^{1}{D}^{0})\}=E({Y}^{0}{D}^{1}{D}^{0}=1)P({D}^{1}{D}^{0}=1);\\ & & E(1D\delta =1)E(1D\delta =0)=E(1{D}^{1})E(1{D}^{0})=P(\text{complier})\text{.}\end{array}$$
A.3.3 Selection Correction Approach
Other than IVE, there are a number of ways to deal with unobserved confounders: sensitivity analysis, bounding method, and selection correction approach. Here we examine only selection correction approach for binary $D$, eschewing sensitivity analysis and bounding method that are not popular in practice. For the sensitivity analysis, interested readers can refer to Lee (2004), Altonji et al. (2005), Ichino et al. (2008), Lee and Lee (2009), Rosenbaum (2010), Huber (2014), and references therein. As for the bounding method, see Manski (2003), Tamer (2010), Choi and Lee (2012), Nevo and Rosen (2012), Chernozhukov et al. (2013), and references therein.
With $W$ denoting covariates including $X$, suppose
$$\begin{array}{rcl}{D}_{i}& =& 1[0<{W}_{i}^{\mathrm{\prime}}\alpha +{\epsilon}_{i}],\text{}\text{}\text{}{Y}_{i}^{d}={X}_{i}^{\mathrm{\prime}}{\beta}_{d}+{U}_{i}^{d},\text{}d=0,1,\text{}\text{}\text{}{W}_{i}=({C}_{i}^{\mathrm{\prime}},{X}_{i}^{\mathrm{\prime}}{)}^{\mathrm{\prime}}\\ \epsilon & \sim & N(0,{\sigma}_{\epsilon}^{2})\u2a3fW,\text{}E({U}^{d}W,\epsilon )=\frac{{\sigma}_{\epsilon d}}{{\sigma}_{\epsilon}^{2}}\epsilon ,\text{}{\sigma}_{\epsilon d}\equiv E(\epsilon {U}_{d}),\text{}\\ {\rho}_{\epsilon d}& \equiv & COR({U}^{d},\epsilon ),\text{}{\sigma}_{d}^{2}\equiv V({U}^{d}).\end{array}$$
This model includes the exclusion restriction that
$C$ is excluded from the
$Y$ equation, which is not necessary, but helpful; see, for example,
Lee (2010a). From the model, we obtain
$$\begin{array}{rcl}\tau & \equiv & E({Y}^{1}{Y}^{0})=E({X}^{\mathrm{\prime}})\cdot ({\beta}_{1}{\beta}_{0}),\\ {\tau}_{d}& \equiv & E({Y}^{1}{Y}^{0}D=d)=E({X}^{\mathrm{\prime}}D=d)({\beta}_{1}{\beta}_{0})\text{}+\text{}E({U}^{1}D=d)E({U}^{0}D=d).\end{array}$$
$\tau $ needs
${\beta}_{1}{\beta}_{0}$, and
${\tau}_{1}$ and
${\tau}_{0}$ need
$E({U}^{1}D=d)E({U}^{0}D=d)$ additionally.
(p.231)
It holds that
$$\begin{array}{rcl}E({U}^{d}W,D=1)& =& E({U}^{d}W,\text{}\epsilon >{W}^{\mathrm{\prime}}\alpha )=\frac{{\sigma}_{\epsilon d}}{{\sigma}_{\epsilon}^{2}}\cdot E(\epsilon W,\text{}\epsilon >{W}^{\mathrm{\prime}}\alpha )\\ & =& \frac{{\sigma}_{\epsilon d}}{{\sigma}_{\epsilon}}\cdot E(\frac{\epsilon}{{\sigma}_{\epsilon}}W,\text{}\frac{\epsilon}{{\sigma}_{\epsilon}}>\frac{{W}^{\mathrm{\prime}}\alpha}{{\sigma}_{\epsilon}})=\frac{{\sigma}_{\epsilon d}}{{\sigma}_{\epsilon}}\frac{\varphi ({W}^{\mathrm{\prime}}\alpha /{\sigma}_{\epsilon})}{1\mathrm{\Phi}({W}^{\mathrm{\prime}}\alpha /{\sigma}_{\epsilon})}\\ & =& {\rho}_{\epsilon d}{\sigma}_{d}\frac{\varphi ({W}^{\mathrm{\prime}}\alpha /{\sigma}_{\epsilon})}{\mathrm{\Phi}({W}^{\mathrm{\prime}}\alpha /{\sigma}_{\epsilon})};\\ E({U}^{d}W,D=0)& =& \frac{{\sigma}_{\epsilon d}}{{\sigma}_{\epsilon}}\frac{\varphi ({W}^{\mathrm{\prime}}\alpha /{\sigma}_{\epsilon})}{\mathrm{\Phi}({W}^{\mathrm{\prime}}\alpha /{\sigma}_{\epsilon})}={\rho}_{\epsilon d}{\sigma}_{d}\frac{\varphi ({W}^{\mathrm{\prime}}\alpha /{\sigma}_{\epsilon})}{\mathrm{\Phi}({W}^{\mathrm{\prime}}\alpha /{\sigma}_{\epsilon})}.\end{array}$$
From this,
$$\begin{array}{rcl}{\tau}_{1}& =& E({X}^{\mathrm{\prime}}D=1)({\beta}_{1}{\beta}_{0})+E\{\frac{\varphi ({W}^{\mathrm{\prime}}\alpha /{\sigma}_{\epsilon})}{\mathrm{\Phi}({W}^{\mathrm{\prime}}\alpha /{\sigma}_{\epsilon})}D=1\}\cdot ({\rho}_{\epsilon 1}{\sigma}_{1}{\rho}_{\epsilon 0}{\sigma}_{0}),\\ {\tau}_{0}& =& E({X}^{\mathrm{\prime}}D=0)({\beta}_{1}{\beta}_{0})+E\{\frac{\varphi ({W}^{\mathrm{\prime}}\alpha /{\sigma}_{\epsilon})}{\mathrm{\Phi}({W}^{\mathrm{\prime}}\alpha /{\sigma}_{\epsilon})}D=0\}\cdot ({\rho}_{\epsilon 1}{\sigma}_{1}+{\rho}_{\epsilon 0}{\sigma}_{0}).\end{array}$$
The parameters can be estimated with ‘Heckman twostage estimator’ (Heckman 1979) applied separately to the T and C groups. First, $\alpha /{\sigma}_{\epsilon}$ is estimated by probit $\hat{\alpha}$, and then
$$\begin{array}{rcl}& & \text{LSE of}(1D)Y\text{}\text{on}\text{}\{(1D)X,\text{}(1D)\frac{\varphi ({W}^{\mathrm{\prime}}\hat{\alpha})}{\mathrm{\Phi}({W}^{\mathrm{\prime}}\hat{\alpha})}\}\\ & & \phantom{\rule{1em}{0ex}}\text{and LSE of}DY\text{on}\{DX,\text{}D\frac{\varphi ({W}^{\mathrm{\prime}}\hat{\alpha})}{\mathrm{\Phi}({W}^{\mathrm{\prime}}\hat{\alpha})}\}\end{array}$$
yield estimates, respectively, for
$${\gamma}_{0}\equiv ({\beta}_{0}^{\mathrm{\prime}},\text{}{\rho}_{{\epsilon}_{0}}{\sigma}_{0}{)}^{\mathrm{\prime}}\text{and}{\gamma}_{1}\equiv ({\beta}_{1}^{\mathrm{\prime}},\text{}{\rho}_{\epsilon 1}{\sigma}_{1}{)}^{\mathrm{\prime}}.$$
Let
${\hat{\gamma}}_{d}$ denote the LSE for
${\gamma}_{d}$, and let the asymptotic variance for
$\sqrt{N}({\hat{\gamma}}_{d}{\gamma}_{d})$ be
${C}_{d}$ with
${\hat{C}}_{d}{\to}^{p}{C}_{d}$. Stack the two estimates and parameters:
$\hat{\gamma}\equiv ({\hat{\gamma}}_{0}^{\mathrm{\prime}},{\hat{\gamma}}_{1}^{\mathrm{\prime}}{)}^{\mathrm{\prime}}$ and
$\gamma \equiv ({\gamma}_{0}^{\mathrm{\prime}},{\gamma}_{1}^{\mathrm{\prime}}{)}^{\mathrm{\prime}}$. Then the asymptotic variance of
$\sqrt{N}(\hat{\gamma}\gamma )$ is
$C\equiv diag({C}_{0},{C}_{1})$ and
$\hat{C}\equiv diag({\hat{C}}_{0},{\hat{C}}_{1}){\to}^{p}C$. The online appendix has a program ‘SelCorcWorkOnVisit’ for the empirical example below, and the program shows how to obtain
$\hat{C}$ with the firststage probit error taken into account, drawing on
Lee (2010a); if getting
$\hat{C}$ looks hard, use nonparametric bootstrap.
Suppose $X$ has the dimension $k\times 1$. Define
$$\underset{k\times (2k+2)}{Q}\equiv ({I}_{k},{0}_{k},{I}_{k},{0}_{k}),\text{where}{0}_{k}\text{is the}k\times 1\text{null vector.}$$
An estimator for
$\tau =E({X}^{\mathrm{\prime}})({\beta}_{1}{\beta}_{0})$ and its asymptotic variance estimator are, respectively,
$${\overline{X}}^{\mathrm{\prime}}\cdot Q\hat{\gamma}\text{}\text{}\text{}\text{}\text{}\text{and}\text{}\text{}\text{}\text{}\text{}\frac{1}{N}{\overline{X}}^{\mathrm{\prime}}\cdot Q\hat{C}{Q}^{\mathrm{\prime}}\cdot \overline{X}.$$
(p.232)
Let ${\overline{X}}_{d}$ be the sample mean of $X$ for the subsample $D=d$, and
$$\underset{(k+1)\times 1}{{\overline{Z}}_{d}}\equiv {[{\overline{X}}_{d}^{\mathrm{\prime}},\text{}\frac{1}{{N}_{d}}\sum _{i\in \{D=d\}}\frac{\varphi \{(2d1)\cdot {W}_{i}^{\mathrm{\prime}}\hat{\alpha}\}}{\mathrm{\Phi}\{(2d1)\cdot {W}_{i}^{\mathrm{\prime}}\hat{\alpha}\}}]}^{\mathrm{\prime}}.$$
An estimator for
${\tau}_{d}$ and its asymptotic variance estimator are
$${\overline{Z}}_{d}^{\mathrm{\prime}}\cdot {Q}_{d}\hat{\gamma}\text{}\text{and}\frac{1}{N}{\overline{Z}}_{d}^{\mathrm{\prime}}\cdot {Q}_{d}\hat{C}{Q}_{d}^{\mathrm{\prime}}\cdot {\overline{Z}}_{d},\text{where}\underset{(k+1)\times (2k+2)}{{Q}_{d}}\equiv \left[\phantom{\rule{negativethinmathspace}{0ex}}\begin{array}{cccc}{I}_{k}& {0}_{k}& {I}_{k}& {0}_{k}\\ {0}_{k}^{\mathrm{\prime}}& 2d1& {0}_{k}^{\mathrm{\prime}}& 2d1\end{array}\phantom{\rule{negativethinmathspace}{0ex}}\right];$$
$2d1=1$ if
$d=1$ and
$1$ if
$d=0$.
For a simple illustration, recall the example of work ($D$) effect on doctor office visits per year ($Y$) in Chapter 1. Setting $W=X$ (no particular variable to exclude from $X$) and
$$X=(1,\text{}age,\text{}schooling,\text{}male,\text{}married,\text{}phy,\text{}psy{)}^{\mathrm{\prime}},$$
where
$male$ and
$married$ are dummies and
$phy$ $\left(psy\right)$ is selfassessed physical (psychological) condition in five categories (the lower the better). The estimated effects (tvalue in
$(\cdot )$) are
$$\begin{array}{rcl}\text{effect on the population}& :& \tau =E({Y}^{1}{Y}^{0})\simeq 3.15\text{}(0.63),\\ \text{effect on the treated}& :& {\tau}_{1}=E({Y}^{1}{Y}^{0}D=1)\simeq 0.42\text{}(0.059),\\ \text{effect on the untreated}& :& {\tau}_{0}=E({Y}^{1}{Y}^{0}D=0)\simeq 11.23\text{}(3.60).\end{array}$$
Interestingly,
${\tau}_{0}$ is significantly negative: making the nonworkers work will reduce doctor office visits by 11 per year. There are 31% nonworkers in the data.
A.4 Supplements for DD Chapter
This section supplements the DD chapter by presenting various nonparametric DD estimators and discussing ‘oneshot’ treatment. These were omitted in the main text to keep the DD chapter within a reasonable limit; for the same reason, we also put the clustered data issues in the TD chapter. Also, a generalization of DD, called ‘change in changes’, is reviewed near the end of this section.
(p.233)
A.4.1 Nonparametric Estimators for Repeated CrossSection DD
Recall the covariates $W$, the treatment qualification dummy $Q$ and the sampling dummy $S$ for the posttreatment period. Let
$$\begin{array}{rcl}{\hat{\mu}}_{11}(w)& \equiv & \frac{\sum _{j}K\{({W}_{j}w)/h\}{Q}_{j}{S}_{j}{Y}_{j}}{\sum _{j}K\{({W}_{j}w)/h\}{Q}_{j}{S}_{j}}{\to}^{p}E(YW=w,Q=1,S=1),\\ {\hat{\mu}}_{10}(w)& \equiv & \frac{\sum _{j}K\{({W}_{j}w)/h\}{Q}_{j}(1{S}_{j}){Y}_{j}}{\sum _{j}K\{({W}_{j}w)/h\}{Q}_{j}(1{S}_{j})}{\to}^{p}E(YW=w,Q=1,S=0),\\ {\hat{\mu}}_{01}(w)& \equiv & \frac{\sum _{j}K\{({W}_{j}w)/h\}(1{Q}_{j}){S}_{j}{Y}_{j}}{\sum _{j}K\{({W}_{j}w)/h\}(1{Q}_{j}){S}_{j}}{\to}^{p}E(YW=w,Q=0,S=1),\\ {\hat{\mu}}_{00}(w)& \equiv & \frac{\sum _{j}K\{({W}_{j}w)/h\}(1{Q}_{j})(1{S}_{j}){Y}_{j}}{\sum _{j}K\{({W}_{j}w)/h\}(1{Q}_{j})(1{S}_{j})}{\to}^{p}E(YW=w,Q=0,S=0).\end{array}$$
Recall the $W$conditional effects on the treated, untreated, and population:
$$E({Y}_{3}^{1}{Y}_{3}^{0}{W}_{3}=w,Q=1),\text{}\text{}\text{}E({Y}_{3}^{1}{Y}_{3}^{0}{W}_{3}=w,Q=0),\text{}\text{}\text{}E({Y}_{3}^{1}{Y}_{3}^{0}{W}_{3}=w),$$
where
$w$ are to be integrated out using
${F}_{WQ=1,S=1}={F}_{{W}_{3}Q=1}$,
${F}_{WQ=0,S=1}={F}_{{W}_{3}Q=0}$, and
${F}_{WS=1}={F}_{{W}_{3}}$ for the respective marginal effects. In view of this, consistent estimators for the effect on the treated, untreated, and population are, respectively,
$$\begin{array}{rcl}& & \frac{1}{\mathrm{\#}\{Q=1,S=1\}}\sum _{i\in \{Q=1,S=1\}}[{\hat{\mu}}_{11}({W}_{i}){\hat{\mu}}_{10}({W}_{i})\{{\hat{\mu}}_{01}({W}_{i}){\hat{\mu}}_{00}({W}_{i})\}],\\ & & \frac{1}{\mathrm{\#}\{Q=0,S=1\}}\sum _{i\in \{Q=0,S=1\}}[{\hat{\mu}}_{11}({W}_{i}){\hat{\mu}}_{10}({W}_{i})\{{\hat{\mu}}_{01}({W}_{i}){\hat{\mu}}_{00}({W}_{i})\}],\\ & & \frac{1}{\mathrm{\#}\{S=1\}}\sum _{i\in \{S=1\}}[{\hat{\mu}}_{11}({W}_{i}){\hat{\mu}}_{10}({W}_{i})\{{\hat{\mu}}_{01}({W}_{i}){\hat{\mu}}_{00}({W}_{i})\}],\end{array}$$
where
$\mathrm{\#}\{\cdot \}$ denotes the number of members in
$\{\cdot \}$.
A.4.2 Nonparametric Estimation for DD with TwoWave Panel Data
With only two periods ($2$ and $3$ for before and after) in hand, nonparametric estimation can be done with $\mathrm{\Delta}{Y}_{3}={Y}_{3}{Y}_{2}$ as a single response variable and ${W}_{2}^{3}\equiv ({C}^{\mathrm{\prime}},{X}_{2}^{\mathrm{\prime}},{X}_{3}^{\mathrm{\prime}}{)}^{\mathrm{\prime}}$ as the covariates; recall ${W}_{it}\equiv ({C}_{i}^{\mathrm{\prime}},{X}_{it}^{\mathrm{\prime}}{)}^{\mathrm{\prime}}$. The resulting estimators are analogous to matching and nonmatching estimators in Chapters 2 and 3 for crosssection data. In this sense, this section may be taken as a “review” on nonparametric estimators for treatment effects. We present four estimators that appeared in Chapters 2 and 3: matching, weighting, regression imputation (RI), and complete pairing (CP). With only two waves, the timeconstancy of $Q$ does not matter, as ${Q}_{i3}$ can be taken as ${Q}_{i}$. In this subsection, we write
$$(\mathrm{\Delta}Y,\mathrm{\Delta}{Y}^{0},W,Q)\text{instead of}(\mathrm{\Delta}{Y}_{3},\mathrm{\Delta}{Y}_{3}^{0},{W}_{2}^{3},{Q}_{3}).$$
(p.234)
Matching Estimators
A matching estimator for the effect on the treated under ID ${}_{DD}$ $\mathrm{\Delta}{Y}^{0}\perp QW$ is
$${\hat{\mu}}_{1}\equiv \frac{1}{{N}_{1}}\sum _{t\in T}\{\mathrm{\Delta}{Y}_{t}\frac{1}{{C}_{t}}\sum _{c\in {C}_{t}}\mathrm{\Delta}{Y}_{c}\}$$
where
${N}_{1}=\sum _{i}{Q}_{i}$,
${C}_{t}$ is the comparison group for treated
$t$ based on a
$W$distance and
${C}_{t}$ is the number of the controls in
${C}_{t}$. Henceforth, for simplification, we pretend that all
$\mathrm{\Delta}{Y}_{t}$‘s are used although some treated individuals may be passed over in practice if no good matched controls are found (i.e., if
${C}_{t}=0$). The dimension problem in matching can be avoided with the propensity score
$\pi (W)\equiv P(Q=1W)$ in place of
$W$. As for the asymptotic inference, nonparametric bootstrap may be used.
A matching estimator for the effect on the untreated under ID ${}_{DD}^{\mathrm{\prime}}$ $({Y}_{3}^{1}.{Y}_{2}^{0})\text{}\perp \text{}QW$ is
$${\hat{\mu}}_{0}\equiv \frac{1}{{N}_{0}}\sum _{c\in C}\{\frac{1}{{T}_{c}}\sum _{t\in {T}_{c}}\mathrm{\Delta}{Y}_{t}\mathrm{\Delta}{Y}_{c}\},$$
where
${T}_{c}$ is the comparison group for control
$c$. We may define
$\mathrm{\Delta}{Y}^{1}\equiv {Y}_{3}^{1}{Y}_{2}^{0}$ to rewrite ID
${}_{DD}^{\mathrm{\prime}}$ as
$\mathrm{\Delta}{Y}^{1}\perp QW$, which would then go parallel with ID
${}_{DD}$ $\mathrm{\Delta}{Y}^{0}\perp QW$.
Combining the two estimators, an estimator for the effect on the population under $\mathrm{\Delta}{Y}^{0}\perp QW$ and $({Y}_{3}^{1}.{Y}_{2}^{0})\text{}\perp \text{}QW$ is
$$\begin{array}{rcl}\hat{\mu}& \equiv & \frac{{N}_{0}}{N}{\hat{\mu}}_{0}+\frac{{N}_{1}}{N}{\hat{\mu}}_{1}=\frac{1}{N}\sum _{i=1}^{N}\{(\mathrm{\Delta}{Y}_{i}\frac{1}{{C}_{i}}\sum _{c\in {C}_{i}}\mathrm{\Delta}{Y}_{c})1[i\in T]\\ & & +(\frac{1}{{T}_{i}}\sum _{t\in {T}_{i}}\mathrm{\Delta}{Y}_{t}\mathrm{\Delta}{Y}_{i})1[i\in C]\}.\end{array}$$
Propensity score matching can be also applied to the effects on the untreated and on the population.
Weighting Estimators
Recall, for crosssection data,
$$\frac{1}{P(D=1)}E\left\{\frac{D\pi (X)}{1\pi (X)}Y\right\}=E({Y}^{1}{Y}^{0}D=1),$$
which is the probability limit of the weighting estimator
${\hat{\tau}}_{1w}$ in Chapter
3. To do analogously for DD, under ID
${}_{DD}$, we just have to use
$$\frac{1}{P(Q=1)}E\left\{\frac{Q\pi (W)}{1\pi (W)}\mathrm{\Delta}Y\right\}=E\{({Y}_{3}^{1}{Y}_{2}^{0})\mathrm{\Delta}{Y}^{0}Q=1\}=E({Y}_{3}^{1}{Y}_{3}^{0}Q=1).$$
Abadie (2005) used this to propose a number of nonparametric weighting estimators where
$\pi (\cdot )$ is nonparametrically estimated. An estimator based on the first term of this
(p.235)
display for the effect on the treated is
$${\hat{\mu}}_{1w}\equiv \frac{N}{{N}_{1}}\frac{1}{N}\sum _{i}\frac{{Q}_{i}\hat{\pi}({W}_{i})}{1\hat{\pi}({W}_{i})}\mathrm{\Delta}{Y}_{i}=\frac{1}{{N}_{1}}\sum _{i}\frac{{Q}_{i}\hat{\pi}({W}_{i})}{1\hat{\pi}({W}_{i})}\mathrm{\Delta}{Y}_{i}.$$
Analogously, the effects on the untreated is
$$\frac{1}{P(Q=0)}E\left\{\frac{Q\pi (W)}{\pi (W)}\mathrm{\Delta}Y\right\}=E({Y}_{3}^{1}{Y}_{3}^{0}Q=0),$$
and its sample version is
$${\hat{\mu}}_{0w}\equiv \frac{N}{{N}_{0}}\frac{1}{N}\sum _{i}\frac{{Q}_{i}\hat{\pi}({W}_{i})}{\hat{\pi}({W}_{i})}\mathrm{\Delta}{Y}_{i}=\frac{1}{{N}_{0}}\sum _{i}\frac{{Q}_{i}\hat{\pi}({W}_{i})}{\hat{\pi}({W}_{i})}\mathrm{\Delta}{Y}_{i}.$$
The effect on the population is the weighted average of $E({Y}_{3}^{1}{Y}_{3}^{0}Q=1)$ and $E({Y}_{3}^{1}{Y}_{3}^{0}Q=0)$, an estimator for which is
$${\hat{\mu}}_{w}\equiv \frac{{N}_{0}}{N}{\hat{\mu}}_{0w}+\frac{{N}_{1}}{N}{\hat{\mu}}_{1w}.$$
Regression Imputation Estimators
A RI estimator for the effect on the treated under $\mathrm{\Delta}{Y}^{0}\perp QW$ is
$$\begin{array}{rcl}& & \frac{1}{{N}_{1}}\sum _{t\in T}\hat{E}(\mathrm{\Delta}Y{W}_{t},Q=1)\frac{1}{{N}_{1}}\sum _{t\in T}\hat{E}(\mathrm{\Delta}Y{W}_{t},Q=0)\\ & & \phantom{\rule{1em}{0ex}}{\to}^{p}E\{\text{}E(\mathrm{\Delta}YW,Q=1)\text{}Q=1\}E\{\text{}E(\mathrm{\Delta}YW,Q=0)\text{}Q=1\}\\ & & \phantom{\rule{1em}{0ex}}=E\{\text{}E({Y}_{3}^{1}{Y}_{2}^{0}W,Q=1)\text{}Q=1\}E\{\text{}E(\mathrm{\Delta}{Y}^{0}W,Q=0)\text{}Q=1\}\\ & & \phantom{\rule{1em}{0ex}}=E\{\text{}E({Y}_{3}^{1}{Y}_{2}^{0}W,Q=1)\text{}Q=1\}E\{\text{}E(\mathrm{\Delta}{Y}^{0}W,Q=1)\text{}Q=1\}\\ & & \phantom{\rule{1em}{0ex}}=E({Y}_{3}^{1}{Y}_{3}^{0}Q=1).\end{array}$$
To be specific on
$\hat{E}(\mathrm{\Delta}Y{W}_{t},Q=d)$, we can use, for a kernel
$K$ and a bandwidth
$h$,
$$\frac{\sum _{j\u03f5\{Q=d\}}K\{({W}_{j}{W}_{t})/h\}\mathrm{\Delta}{Y}_{j}}{\sum _{j\u03f5\{Q=d\}}K\{({W}_{j}{W}_{t})/h\}}.$$
Analogous to the RI estimator for the effect on the treated are the following estimators for the effect on the untreated under $({Y}_{3}^{1}{Y}_{2}^{0})\perp QW$ and on the population under $\mathrm{\Delta}{Y}^{0}\perp QW$ and $({Y}_{3}^{1}{Y}_{2}^{0})\perp QW$:
$$\begin{array}{rcl}& & \frac{1}{{N}_{0}}\sum _{c\in C}\hat{E}(\mathrm{\Delta}Y{W}_{c},Q=1)\frac{1}{{N}_{0}}\sum _{c\in C}\hat{E}(\mathrm{\Delta}Y{W}_{c},Q=0)\\ & & \phantom{\rule{1em}{0ex}}{\to}^{p}E\{\text{}E(\mathrm{\Delta}YW,Q=1)\text{}Q=0\}E\{\text{}E(\mathrm{\Delta}YW,Q=0)\text{}Q=0\}\\ & & \phantom{\rule{1em}{0ex}}=E\{\text{}E({Y}_{3}^{1}{Y}_{2}^{0}W,Q=1)\text{}Q=0\}E\{\text{}E(\mathrm{\Delta}{Y}^{0}W,Q=0)\text{}Q=0\}\\ & & \phantom{\rule{1em}{0ex}}=E\{\text{}E({Y}_{3}^{1}{Y}_{2}^{0}W,Q=0)\text{}Q=0\}E\{\text{}E(\mathrm{\Delta}{Y}^{0}W,Q=0)\text{}Q=0\}\\ & & \phantom{\rule{1em}{0ex}}=E({Y}_{3}^{1}{Y}_{3}^{0}Q=0);\end{array}$$
(p.236)
$$\begin{array}{rcl}& & \frac{1}{N}\sum _{i=1}^{N}\hat{E}(\mathrm{\Delta}Y{W}_{i},Q=1)\frac{1}{N}\sum _{i=1}^{N}\hat{E}(\mathrm{\Delta}Y{W}_{i},Q=0)\\ & & \phantom{\rule{1em}{0ex}}{\to}^{p}E\{\text{}E(\mathrm{\Delta}YW,Q=1)\text{}\}E\{\text{}E(\mathrm{\Delta}YW,Q=0)\text{}\}\\ & & \phantom{\rule{1em}{0ex}}=E\{\text{}E({Y}_{3}^{1}{Y}_{2}^{0}W)\text{}\}E\{\text{}E(\mathrm{\Delta}{Y}^{0}W)\text{}\}=E({Y}_{3}^{1}{Y}_{3}^{0}).\end{array}$$
Complete Pairing Estimator
A CP estimator for DD with continuously distributed $W$ is
$$\begin{array}{rcl}& & {\hat{DD}}_{23}\equiv \frac{1}{{N}_{0}{N}_{1}}\sum _{c=1}^{{N}_{0}}\sum _{t=1}^{{N}_{1}}K(\frac{{W}_{t}{W}_{c}}{h})(\mathrm{\Delta}{Y}_{t}\mathrm{\Delta}{Y}_{c})/\frac{1}{{N}_{0}{N}_{1}}\sum _{c=1}^{{N}_{0}}\sum _{t=1}^{{N}_{1}}K(\frac{{W}_{t}{W}_{c}}{h})\\ & & \text{}\text{}\text{}\text{}\text{}\text{}\stackrel{p}{\u27f6}{\mu}_{23}\equiv \int \{E(\mathrm{\Delta}YW=w,Q=1)E(\mathrm{\Delta}YW=w,Q=0)\}{\omega}_{23}(w)\mathrm{\partial}w,\end{array}$$
where
${\omega}_{23}(w)$ is the product of density of
$WQ=1$ and density of
$WQ=0$ evaluated at
$w$.
Suppose $\mathrm{\Delta}{Y}^{0}\perp QW$ and $({Y}_{3}^{1}{Y}_{2}^{0})\perp QW$ (or $(\mathrm{\Delta}{Y}_{3}^{0},{Y}_{3}^{1}{Y}_{2}^{0})\u2a3fQ{W}_{2}^{3}$) hold so that we can drop $Q$ in the integrand of ${\mu}_{23}$, which then becomes
$$E({Y}_{3}^{1}{Y}_{2}^{0}W=w)E(\mathrm{\Delta}{Y}_{3}^{0}W=w)=E({Y}_{3}^{1}{Y}_{3}^{0}W=w).$$
Hence we will be estimating the effect on the population. CP can be done also with
$\pi (W)$ instead of
$W$ to avoid the dimension problem; that is, replace
$W$ in
${\hat{DD}}_{23}$ with
$\hat{\pi}(W)=\hat{P}(Q=1W)$ that is the estimated probit/logit probability. Nonparametric bootstrap or the CP asymptotic variance ignoring the
$\pi (W)$estimation error can be used for asymptotic inference.
A.4.3 Panel Linear Model Estimation for DD with OneShot Treatment
Recall the covariates and treatmentinteracting covariates for panel data DD:
$${W}_{it}=({C}_{i}^{\mathrm{\prime}},{X}_{it}^{\mathrm{\prime}}{)}^{\mathrm{\prime}}\text{and}{G}_{it}=({A}_{i}^{\mathrm{\prime}},{H}_{it}^{\mathrm{\prime}}{)}^{\mathrm{\prime}}$$
where
${A}_{i}$ consists of elements of
${C}_{i}$ and
${H}_{it}$ consists of elements of
${X}_{it}$. With oneshot treatment at
$t=\tau $, we have
${D}_{it}={Q}_{i}1[t=\tau ]$ and the model becomes, for
$t=0,\dots ,T$,
$$\begin{array}{rcl}& & {Y}_{it}^{0}={\beta}_{t}+{\beta}_{q}{Q}_{i}+{\beta}_{qc}^{\mathrm{\prime}}{Q}_{i}{C}_{i}+{\beta}_{w}^{\mathrm{\prime}}{W}_{it}+{V}_{it},\text{}\text{}\text{}\text{}\text{}{V}_{it}={\delta}_{i}+{U}_{it}\\ & & {Y}_{it}^{1}={Y}_{it}^{0}+{\beta}_{d}1[t=\tau ]+{\beta}_{dg}^{\mathrm{\prime}}1[t=\tau ]{G}_{it}\\ & \u27f9& {Y}_{it}={\beta}_{t}+{\beta}_{d}{Q}_{i}1[t=\tau ]+{\beta}_{dg}^{\mathrm{\prime}}{Q}_{i}1[t=\tau ]{G}_{it}+{\beta}_{q}{Q}_{i}+{\beta}_{qc}^{\mathrm{\prime}}{Q}_{i}{C}_{i}+{\beta}_{w}^{\mathrm{\prime}}{W}_{it}+{V}_{it}.\end{array}$$
The treatment effect is
${\beta}_{d}+{\beta}_{dg}^{\mathrm{\prime}}{G}_{\tau}$ only at period
$t=\tau $, and zero at the other periods.
(p.237)
Observe
$$\begin{array}{rcl}& & \mathrm{\Delta}{\beta}_{t}=\mathrm{\Delta}{\beta}_{1}+(\mathrm{\Delta}{\beta}_{2}\mathrm{\Delta}{\beta}_{1})1[t=2]+,\text{}\dots ,\text{}+(\mathrm{\Delta}{\beta}_{T}\mathrm{\Delta}{\beta}_{1})1[t=T];\\ & & \phantom{\rule{1em}{0ex}}\mathrm{\Delta}{Q}_{i}1[t=\tau ]={Q}_{i}1[t=\tau ]{Q}_{i}1[t1=\tau ]={Q}_{i}(1[t=\tau ]1[t=\tau +1]);\\ & & \phantom{\rule{1em}{0ex}}\mathrm{\Delta}({\beta}_{dg}^{\mathrm{\prime}}{Q}_{i}1[t=\tau ]{G}_{it})=\mathrm{\Delta}({\beta}_{dg}^{\mathrm{\prime}}{Q}_{i}1[t=\tau ]{G}_{i\tau})\\ & & \phantom{\rule{2em}{0ex}}={\beta}_{dg}^{\mathrm{\prime}}{Q}_{i}{G}_{i\tau}(1[t=\tau ]1[t=\tau +1]).\end{array}$$
Using these, difference the
${Y}_{it}$ equation to obtain
$$\begin{array}{rcl}\mathrm{\Delta}{Y}_{it}& =& \mathrm{\Delta}{\beta}_{1}+(\mathrm{\Delta}{\beta}_{2}\mathrm{\Delta}{\beta}_{1})1[t=2]+,\text{}\dots ,\text{}+(\mathrm{\Delta}{\beta}_{T}\mathrm{\Delta}{\beta}_{1})1[t=T]\\ & & +{\beta}_{d}{Q}_{i}(1[t=\tau ]1[t=\tau +1])+{\beta}_{dg}^{\mathrm{\prime}}{Q}_{i}{G}_{i\tau}(1[t=\tau ]1[t=\tau +1])\\ & & +{\beta}_{x}^{\mathrm{\prime}}\mathrm{\Delta}{X}_{it}+\mathrm{\Delta}{U}_{it}.\end{array}$$
To implement DD with this, LSE can be applied under
$$E(\mathrm{\Delta}{U}_{t}{G}_{\tau},\mathrm{\Delta}{X}_{t},Q)=0\text{}\text{}\text{}\text{}\text{}\mathrm{\forall}t=1,\dots ,T.$$
With no interacting covariates (
${\beta}_{dg}=0$), the LSE is for
$\mathrm{\Delta}{Y}_{it}$ only on
$$1,1[t=2],\dots ,1[t=T],\text{}{Q}_{i}(1[t=\tau ]1[t=\tau +1]),\text{}\mathrm{\Delta}{X}_{it};$$
the slope of
${Q}_{i}(1[t=\tau ]1[t=\tau +1])$ is the desired effect
${\beta}_{d}$.
To implement the LSE with $t=0,1,2,3,4,5$ and $\tau =3$, observe
$$\begin{array}{rcl}\left[\begin{array}{c}\mathrm{\Delta}{Y}_{i1}\\ \mathrm{\Delta}{Y}_{i2}\\ \mathrm{\Delta}{Y}_{i3}\\ \mathrm{\Delta}{Y}_{i4}\\ \mathrm{\Delta}{Y}_{i5}\end{array}\right]& =& \left[\begin{array}{cccccccc}1& 0& 0& 0& 0& 0& 0& \mathrm{\Delta}{X}_{i1}^{\mathrm{\prime}}\\ 1& 1& 0& 0& 0& 0& 0& \mathrm{\Delta}{X}_{i2}^{\mathrm{\prime}}\\ 1& 0& 1& 0& 0& {Q}_{i}& {G}_{i3}^{\mathrm{\prime}}{Q}_{i}& \mathrm{\Delta}{X}_{i3}^{\mathrm{\prime}}\\ 1& 0& 0& 1& 0& {Q}_{i}& {G}_{i3}^{\mathrm{\prime}}{Q}_{i}& \mathrm{\Delta}{X}_{i4}^{\mathrm{\prime}}\\ 1& 0& 0& 0& 1& 0& 0& \mathrm{\Delta}{X}_{i5}^{\mathrm{\prime}}\end{array}\right]\left[\begin{array}{c}\mathrm{\Delta}{\beta}_{1}\\ \mathrm{\Delta}{\beta}_{2}\mathrm{\Delta}{\beta}_{1}\\ \mathrm{\Delta}{\beta}_{3}\mathrm{\Delta}{\beta}_{1}\\ \mathrm{\Delta}{\beta}_{4}\mathrm{\Delta}{\beta}_{1}\\ \mathrm{\Delta}{\beta}_{5}\mathrm{\Delta}{\beta}_{1}\\ {\beta}_{d}\\ {\beta}_{dg}\\ {\beta}_{x}\end{array}\right]\\ & & \phantom{\rule{1em}{0ex}}+\left[\begin{array}{c}\mathrm{\Delta}{U}_{i1}\\ \mathrm{\Delta}{U}_{i2}\\ \mathrm{\Delta}{U}_{i3}\\ \mathrm{\Delta}{U}_{i4}\\ \mathrm{\Delta}{U}_{i5}\end{array}\right];\end{array}$$
the
$\mathrm{\Delta}{Y}_{i3}$,
$\mathrm{\Delta}{Y}_{i4}$, and
$\mathrm{\Delta}{Y}_{i5}$ equations are obtained by setting
$t=3,4,5$ in the
$\mathrm{\Delta}{Y}_{it}$ equation:
$$\begin{array}{rcl}& & \mathrm{\Delta}{Y}_{i3}=\mathrm{\Delta}{\beta}_{1}+(\mathrm{\Delta}{\beta}_{3}\mathrm{\Delta}{\beta}_{1})+{\beta}_{d}{Q}_{i}+{\beta}_{dg}^{\mathrm{\prime}}{Q}_{i}{G}_{i3}+{\beta}_{x}^{\mathrm{\prime}}\mathrm{\Delta}{X}_{i3}+\mathrm{\Delta}{U}_{i3},\\ & & \mathrm{\Delta}{Y}_{i4}=\mathrm{\Delta}{\beta}_{1}+(\mathrm{\Delta}{\beta}_{4}\mathrm{\Delta}{\beta}_{1}){\beta}_{d}{Q}_{i}{\beta}_{dg}^{\mathrm{\prime}}{Q}_{i}{G}_{i3}+{\beta}_{x}^{\mathrm{\prime}}\mathrm{\Delta}{X}_{i4}+\mathrm{\Delta}{U}_{i4},\\ & & \mathrm{\Delta}{Y}_{i5}=\mathrm{\Delta}{\beta}_{1}+(\mathrm{\Delta}{\beta}_{5}\mathrm{\Delta}{\beta}_{1})+{\beta}_{x}^{\mathrm{\prime}}\mathrm{\Delta}{X}_{i5}+\mathrm{\Delta}{U}_{i5}.\end{array}$$
(p.238)
A.4.4 Change in Changes
For repeated crosssections, Athey and Imbens (2006) proposed a DD without assuming the additivity of the effects. In this section, we review their approach, which is hard to use in practice, not least because of the difficulty in allowing for covariates.
Athey and Imbens assumed that the untreated response ${Y}^{0}$ is determined by
$${Y}_{i}^{0}=h({U}_{i},{S}_{i})$$
for an unknown function
$h(u,s)$ strictly increasing in
$u$; recall that
$S=1$ is the dummy for being sampled in the treated period. In this model,
${Y}^{0}$ is the same across the treatment and control groups because the group dummy
$Q$ does not enter
$h$ directly, although
$Q$ may enter indirectly; for example,
$U=(1Q){U}^{0}+Q{U}^{1}$ for group
$0$ and
$1$ error terms
${U}^{0}$ and
${U}^{1}$.
Further assume
$$\begin{array}{rcl}(i)& :& U\u2a3fSQ\\ (ii)& :& \text{the support of}{U}^{1}\text{is a subset of the support of}{U}^{0}\text{.}\end{array}$$
The assumption
$(i)$ is a weaker version of
$S$ being independent of the other random variables, and
$(ii)$ is needed to construct the desired counterfactual
${Y}^{0}(Q=1,S=1)$ using the identified distributions of the control group’s before and after responses,
$\text{}{Y}^{0}(Q=0,S=0)$ and
${Y}^{0}(Q=0,S=1)$. To simplify exposition, assume that
$U$ is continuously distributed.
The main challenge in any crosssection DD is constructing the counterfactual ${Y}^{0}(Q=1,S=1)$. If we knew $h(\cdot ,\cdot )$ in ${Y}^{0}=h(U,S)$, then we would be able to find ${Y}^{0}(Q=1,S=1)$ by using the inverse function ${h}^{1}(\cdot ;S)$ of $h(U,S)$ for a given $S$. Specifically, first, find $U$ using
$$U={h}^{1}({Y}^{0};0)\text{for}S=0\text{.}$$
Second, with this
$U$, the desired counterfactual
${Y}^{0}(Q=1,S=1)$ would be
$$h(U,1)=h\{{h}^{1}({Y}^{0};0),1\}.$$
Although
$h$ is not observed and thus this scenario does not work, it is enough to know the distribution of
${Y}^{0}(Q=1,S=1)$ to obtain its mean as follows, not the individual
${Y}_{i}^{0}({Q}_{i}=1,{S}_{i}=1)$ for each
$i$.
With ${F}_{{Y}^{0},jk}$ denoting the distribution function of ${F}_{{Y}^{0}Q=j,S=k}$, the key equation in Athey and Imbens (2006) that gives the desired counterfactual distribution is
$${F}_{{Y}^{0},11}(y)={F}_{Y,10}[\text{}{F}_{Y,00}^{1}\{{F}_{Y,01}(y)\}\text{}];$$
$Y$ on the righthand side equals
${Y}^{0}$ because
$(10)$,
$(00)$, and
$(01)$ groups are all untreated. Using this, the desired counterfactual mean
$E({Y}^{0}Q=1,S=1)$ is
(p.239)
$$\begin{array}{rcl}\int y\mathrm{\partial}{F}_{{Y}^{0},11}(y)& =& \int y\cdot \mathrm{\partial}{F}_{Y,10}[{F}_{Y,00}^{1}\{{F}_{Y,01}(y)\}]=\int {F}_{Y,01}^{1}\{{F}_{Y,00}({y}^{\mathrm{\prime}})\}\cdot \mathrm{\partial}{F}_{Y,10}({y}^{\mathrm{\prime}}),\\ \text{setting}y& =& {F}_{Y,01}^{1}\{{F}_{Y,00}({y}^{\mathrm{\prime}})\}\text{}\u27fa\text{}{F}_{Y,00}^{1}\{{F}_{Y,01}(y)\}={y}^{\mathrm{\prime}}.\end{array}$$
Using the last expression $\int {F}_{Y,01}^{1}\{{F}_{Y,00}({y}^{\mathrm{\prime}})\}\cdot \mathrm{\partial}{F}_{Y,10}({y}^{\mathrm{\prime}})$ for $E({Y}^{0}Q=1,S=1)$, Athey and Imbens proposed “change in changes (CC)”:
$$\begin{array}{rcl}& & E(YQ=1,S=1)\text{}\text{}E[\text{}{F}_{Y,01}^{1}\{{F}_{Y,00}(Y)\}\text{}Q=1,S=0\text{}]\\ & & \phantom{\rule{1em}{0ex}}=E({Y}^{1}Q=1,S=1)E[\text{}{F}_{Y,01}^{1}\{{F}_{Y,00}({Y}^{0})\}\text{}Q=1,S=0\text{}]\\ & & \phantom{\rule{1em}{0ex}}=E({Y}^{1}Q=1,S=1)E({Y}^{0}Q=1,S=1)\\ & & \phantom{\rule{1em}{0ex}}=E({Y}^{1}{Y}^{0}Q=1,S=1),\end{array}$$
which is the effect on the treated at the posttreatment era.
A sample version of CC is
$$\frac{1}{\mathrm{\#}\{Q=1,S=1\}}\sum _{i\in \{Q=1,S=1\}}{Y}_{i}\text{}\text{}\frac{1}{\mathrm{\#}\{Q=1,S=0\}}\sum _{i\in \{Q=1,S=0\}}{\hat{F}}_{Y,01}^{1}\{{\hat{F}}_{Y,00}({Y}_{i})\},$$
where
${\hat{F}}_{Y,jk}$ is the empirical distribution function of
$Y(Q=j,S=k)$. Covariates can be allowed by conditioning the above means and distributions on
$X$ (i.e., using only the observations sharing the same value of
$X$ in the sample version), but this is hardly practical.
A simple intuitive way to understand CC is using “quantile time effect.” Recall that the main task in DD is constructing the counterfactual untreated response with only the time effect in the treatment group. More specifically, the task is finding the time effect using only the control group first, and then adding it to the treatment group’s pretreatment response, under the identification condition that the time effect is the same across the two groups.
To see how the time effect can be extracted using the control group, use the “$p$quantile time effect” for the control group:
$$\begin{array}{rcl}{F}_{Y,01}^{1}(p){F}_{Y,00}^{1}(p)& =& {F}_{Y,01}^{1}\{{F}_{Y,00}(y)\}{F}_{Y,00}^{1}\{{F}_{Y,00}(y)\}\\ & & \text{(with the}p\text{quantile}y\text{of}{F}_{Y,00}\text{:}p={F}_{Y,00}(y)\text{)}\\ & =& {F}_{Y,01}^{1}\{{F}_{Y,00}(y)\}y\text{.}\end{array}$$
Without the time effect, this equation becomes
$yy=0$ because
${F}_{Y,01}={F}_{Y,00}$. That is, the function
$$y\u27fc{F}_{Y,01}^{1}\{{F}_{Y,00}(y)\}$$
gives the timeeffectaugmented untreated response. Hence, putting
$Y(Q=1,S=0)$ into this function gives the timeeffectaugmented version
${Y}^{0}(Q=1,S=1)$ as was seen in CC:
$$E[\text{}{F}_{Y,01}^{1}\{{F}_{Y,00}(Y)\}\text{}Q=1,S=0\text{}].$$
(p.240)