Empirical Economics with R (Part B): Confounders, Proxies and Sources of Exogenous Variations

14 Jan 2021

Chapter 3 of my course Empirical Economics with R mainly dives into the difficulties of estimating causal effects. (See the previous post for an overview of the earlier chapters dealing with prediction.)

One important point I emphasize in this chapter with several small applications and simulations is that for the estimation of causal effects, it does typically not suffice to only think about possible confounders, but we should also be confident that there is a compelling source of exogenous variation for our explanatory variable of interest. To be more concrete, consider the regression model

\[y = \beta_0 + \beta_1 x_1 + u\]

where $\beta_1$ shall measure a linear causal effect of $x_1$ on $y$ that we would like to estimate. The following graph shall illustrate the true data generating process:

The variable $x_2$ shall be a confounder that influences both $x_1$ and the dependent variable $y$. If we omit the confounder $x_2$ from our regression our OLS estimator $\hat \beta_1$ of the causal effect $\beta_1$ will be biased and might be far off the true causal effect. Typically, we won’t be able to exactly control for every confounder, but we may observe proxy variables that are correlated with the confounders. Adding a good proxy (high correlation with confounder) as control variable may substantially reduce the bias of our estimator.

Let us look at a simulated data set:

sim.dat = function(n=100000,beta0=0, beta1=1,beta2=1,
                   sd.proxy.noise=1, sd.exo=1) {
  x2 = rnorm(n,0,1)
  proxy = x2+rnorm(n,0,sd.proxy.noise)
  x1 = x2+rnorm(n,0,sd.exo)
  
  eps = rnorm(n,0,1)
  y = beta0+beta1*x1 + beta2*x2 + eps
  data.frame(y,x1,x2,proxy)
}
set.seed(123)
dat = sim.dat(beta1=1,sd.proxy.noise = 0.05, sd.exo = 1)
library(stargazer)
stargazer(lm(y~x1,dat), lm(y~x1+x2,dat), lm(y~x1+proxy,dat),type="html", keep.stat = c("rsq"))


	Dependent variable:

	y
	(1)	(2)	(3)

x1	1.502^***	1.002^***	1.005^***
	(0.003)	(0.003)	(0.003)

x2		1.000^***
		(0.004)

proxy			0.994^***
			(0.004)

Constant	0.001	0.001	0.0003
	(0.004)	(0.003)	(0.003)


R²	0.750	0.833	0.832

Note:	^p<0.1; ^p<0.05; ^**p<0.01

While we estimate the causal effect $\beta_1 = 1$ with a large bias if we don’t add any control variable, controlling for proxy is in our simulation almost as good as if directly controlling for the confounder x2. The reason is that our proxy is extremely good, it only has very small random noise (determined by sd.proxy.noise) and is thus highly correlated with the confounder:

cor(dat$x2, dat$proxy)

## [1] 0.9987416

The following simulation only reduces the exogenous variation in x1 by reducing the parameter sd.exo, i.e. now the largest source of variation of x1 is the confounder x2.

set.seed(123)
dat = sim.dat(beta1=1,sd.proxy.noise = 0.05, sd.exo = 0.025)
stargazer(lm(y~x1,dat), lm(y~x1+x2,dat), lm(y~x1+proxy,dat),type="html",keep.stat = c("rsq"))


	Dependent variable:

	y
	(1)	(2)	(3)

x1	2.001^***	1.079^***	1.971^***
	(0.003)	(0.127)	(0.057)

x2		0.922^***
		(0.127)

proxy			0.030
			(0.056)

Constant	0.001	0.001	0.001
	(0.003)	(0.003)	(0.003)


R²	0.800	0.800	0.800

Note:	^p<0.1; ^p<0.05; ^**p<0.01

While our proxy is still great (cor(x2,proxy)=0.9987), now adding the proxy as control variable helps almost nothing to reduce the bias of our OLS estimator $\hat \beta_1$. I think this simulation illustrates a very important insight:

For estimation of causal effects, it does typically not suffice to well control for confounders, we also need a sufficiently strong source of exogenous variation for our explanatory variable of interest.

Many modern empirical economic articles that study non-experimental data are very aware of this point and carefully describe the identifying variation in the variable of interest. Yet, somehow I never really got this point in my own days as a student. Therefore, one focus of this chapter is to discuss the possible sources of variation of the explanatory variable of interest in several small applications. One conclusion is that many typical field data sets just don’t allow to convincingly estimate certain causal effects because there is no clear source of exogenous variation.

Note that thinking of exogenous variation is also important if you apply modern machine learning techniques adapted to the estimation of causal effects. I discussed this point in context of post-double selection with lasso regressions here.

There are several other insights that chapter 3 point outs:

If possible try to run a randomized experiment to generate exogenous variation in the explanatory variable of interest.
If we don’t have a consistent estimator $\hat \beta_1$ of a certain causal effect, also confidence intervals can be far away from the true causal effect.
Under relative weak assumptions $\hat \beta$ estimates consistently the so called coefficients of the best linear predictor. Just by looking at a regression equation it is not clear what a stated $\beta_1$ shall measure, a particular causal effect or a coefficient of the best linear predictor.
The chapter also illustrates how to estimate non-linear and heterogeneous treatment effects. My main interpretation of simple linear specifications without interaction terms, is that their coefficients measure some average over likely more complex non-linear and heterogeneous effects. Yet, the potential outcomes framework, which takes heterogeneity much more serious, will be discussed only later in chapter 5.

You can find all material in the course’s Github repository. Take a look at the setup instructions if you want to solve the RTutor problem sets on your own computer.

Published on 14 Jan 2021 •