The goal of the bate
package is to present some
functions to conduct sensitivity analysis of omitted variable bias in
linear econometric models. The functions in this package present
functions to implement two approaches: (a) the partial R-squared
approach of Cinelli and Hazlett (2020); and (b) the δ-Rmax
approach of Oster (2019) implemented with an algorithm proposed in Basu
(2022).
The Cinelli and Hazlett (2020) approach is implemented by the
cinhaz
function. To use this function, the user chooses the
following parameters:
Here is an example:
library(sensemakr)
#> See details in:
#> Carlos Cinelli and Chad Hazlett (2020). Making Sense of Sensitivity: Extending Omitted Variable Bias. Journal of the Royal Statistical Society, Series B (Statistical Methodology).
bate::cinhaz(
kd=1, ky=1, data=darfur,outcome = "peacefactor",
treatment = "directlyharmed", bnch_reg = "female",
other_reg = c("village","age","farmer_dar","herder_dar","pastvoted","hhsize_darfur"),
alpha = 0.05
)
#> Total R2-Based Partial R2-Based
#> Estimate 0.09731582 0.09731582
#> Std Error 0.02325654 0.02325654
#> R2YD|X 2.18730933 2.18730933
#> RV_q 13.87763539 13.87763539
#> RV_qa 7.62579655 7.62579655
#> R2YZ|DX 25.90716055 12.46409230
#> R2DZ|X 0.26800005 0.91642867
#> CI:Lower Limit 0.03449265 0.02956773
#> CI:Upper Limit 0.12579774 0.12087282
In implementing the δ-Rmax
approach of Oster (2019), the bate
package presents some
functions to compute quantiles of the empirical distribution of the
bias-adjusted treatment effect (BATE).
To analyze such models, a researcher should consider four regression models: (a) a short regression model where the outcome variable is regressed on the treatment variable, with or without additional controls; (b) an intermediate regression model where additional control variables are added to the short regression; (c) a hypothetical long regression model, where an index of the omitted variable(s) is added to the intermediate regressions; and (d) an auxiliary regression where the treatment variable is regressed on all observed (and included) control variables.
As an example, suppose a researcher has estimated the following model, y = α + β1x + γ1w1 + γ2w2 + ε, and is interested in understanding the impact of some omitted variables on the results. In this case:
In this example, the estimated treatment effect is β̂1. In the presence of omitted variables, β̂1 is a biased estimate of the true treatment effect, β1. The functions in this package will allow researchers to create quantiles of the empirical distribution of the BATE, i.e. the treatment effect once we have adjusted for the effect of omitted variable bias. We will denote the BATE as β*.
The researcher will need to supply the data set (as a data frame), the name of the outcome variable, the name of the treatment variable, and the names of the additional regressors in the intermediate regression. The functions in this package will then compute the quantiles of the empirical distribution of BATE, β*.
Two parameters capture the effect of the omitted variables in this set up.
The first parameter is δ.
This captures the relative strength of the unobservables, compared to
the observable controls, in explaining variation in the treatement
variable. In the functions below this is denoted as the parameter
delta
. This parameter is a real number and can take any
value on the real line, i.e. it is unbounded. Hence, in any specific
analysis, the researcher will have to choose a lower and an upper bound
for delta
. For instance, if in any empirical analysis, the
researcher believes, based on knowledge of the specific problem being
investigated, that the unobservables are less important than
the observed controls in explaining the variation in the treatment
variable, then she could choose delta
to lie between 0
and 1. On the other hand, if she believes that the unobservables are
more important than the observed controls in explaining the
variation in the treatment variable, then she should choose
delta
to lie between 1 and 2 or 1 and 5.
The second parameter is Rmax.
This captures the relative strength of the unobservables, compared to
the observable controls, in explaining variation in the outcome
variable. In the functions below, this is captured by the parameter
Rmax
. The parameter Rmax
is the R-squared in
the hypothetical long regression. Hence, it lies between the R-squared
in the intermediate regression (R̃) and 1. Since the lower bound of
Rmax
is given by R̃, in any specific analysis, the
researcher will only have to choose an upper bound for
Rmax
.
In a specific empirical analysis, a researcher will use domain
knowledge about the specific issue under investigation to determine a
plausible range for delta
(e.g. 0.01 ≤ δ ≤ 0.99). This will be given
by the interval on the real line lying between deltalow
and
deltahigh
(the researcher will choose deltalow
and deltahigh
). Using the example in this paragraph,
deltalow=0.01
and deltahigh=0.99
.
In a similar manner, a researcher will use domain knowledge about the
specific issue under investigation to determine Rmax
. Here,
it will be important to keep in mind that Rmax
is the
R-squared in the hypothetical long regression. Now, it is unlikely that
including all omitted variables and thereby estimating the hypothetical
long regression will give an R-squared of 1. This is because, even after
all the regressors have been included, some variation of the outcome
might be plausibly explained by a stochastic element. Hence,
Rmax
will most likely be different from, and less than, 1.
This will be denoted by Rhigh
(e.g. Rmax=0.61
).
How is the omitted variable bias and the BATE computed? The key result that is used to compute the BATE is this: the omitted variable bias is the real root of the following cubic equation aν3 + bν2 + cν + d = 0, where
where, in turn,
Hence, we see that the coefficients of the cubic equation are
functions of the variances of the outcome and treatment variables (σY2, σX2),
parameters of the short regression (β̊, R̊), intermediate
regression (β̃, R̃) and
auxiliary regression (τX2),
and the values of delta
and Rmax
. The
important result is that the omitted variable bias is the real root of
the above cubic equation (Proposition 2, Oster, 2019;
Proposition 1 and 2 in Basu, 2022).
In a specific empirical analysis, the variances of the outcome and
treatment variables, and the parameters of the short, intermediate and
auxiliary regressions are known. Hence, the coefficients of the cubic
equation become functions of delta
and Rmax
,
the two key parameters that the researcher chooses, using domain
knowledge.
Once the researcher has chosen deltalow
,
deltahigh
and Rhigh
, this defines a bounded
box on the (delta
, Rmax
) plane defined by the
Cartesian product of the interval [deltalow
,
deltahigh
] and of the interval [Rlow
,
Rhigh
]. The main functions in this package computes the
root of the cubic equation on a sufficiently granular grid (the degree
of granularity will be chosen by the user) covering the bounded box.
To compute the root of the cubic equation, the algorithm first evaluates the discriminant of the cubic equation on each point of the grid and partitions the box into two regions: (a) unique real root (URR) and NURR (no unique real root). There are three cases to consider.
delta
direction to generate a nonempty intersection
with a URR region. Once that is found, the algorithm implements the
steps outlined in step 2.The bias is then used to compute the BATE, β*, which is defined as the estimated treatment effect in the intermediate regression minus the bias, i.e. β* = β̃ − ν, where β* is the bias-adjusted treatment effect, β̃ is the treatment effect estimated from the intermediate regression and ν is the real root of the relevant cubic equation.
The functions in this package will compute β* at each point of the grid that covers the bounded box chosen by the researcher. Hence, this will generate a large vector of values of β* and we can use this to compute the empirical distribution of β*, the BATE. Asymptotic theory shows that the BATE converges in probability to the true treatment effect, i.e. $$\beta^* \overset{p}{\to} \beta.$$ Hence, the interval defined by the 2.5-th and 97.5-th quantiles of the empirical distribution of the BATE will contain the true treatment effect with 95 percent probability.
This package provides three functions to compute quantiles of the
empirical distribution of the omitted bias and BATE,
ovbias()
, ovbias_lm()
and
ovbias_par()
. These functions implement the same algorithm
to compute the empirical distribution of the bias and BATE and only
differ in how the user provides the parameters of the short,
intermediate and auxiliary regressions. But before we discuss the main
functions, let us look at two other functions that are useful.
An useful function to collect relevant parameters from the short, intermediate and auxiliary regressions is:
collect_par()
: collects parameters from the short,
intermediate and auxiliary regressions; (user provides name of the data
set, name of outcome variable, name of treatment variable, names of
control variables in the short regression, if relevant, and names of
additional variables in the intermediate regression); the output of this
function is a data frame.Users can use the output from collect_par()
to construct
an area plot of the bounded box using:
urrplot()
: creates a colored area plot of the bounded
box chosen by the user demarcating the area where the cubic equation has
unique real root (URR) from the area where the cubic equation has three
real roots (NURR); the output is a plot object.To use the ovbias()
function, the user will need to run
first collect the parameters from the short, intermediate and auxiliary
regressions, using the collect_par()
function and then feed
it into the ovbias()
function, along with the
following:
delta
(e.g. 0.01)delta
(e.g. 0.99)Rmax
(e.g. 0.61)In using the collect_par()
function, the user will need
to specify the following:
The output of the ovbias()
function is a list of three
elements.
bias
) and
bias-adjusted treatment effect (bstar
) for each point on
the grid.bias
.bstar
).To use the ovbias_par()
function, the user needs to
specify the following:
delta
(e.g. 0.01)delta
(e.g. 0.99)Rmax
(e.g. 0.61)The output of the ovbias_par()
function is identically
same with the output of the ovbias()
function.
To use the ovbias_lm()
function, the user needs to
specify three lm
objects that capture the short,
intermediate and auxiliary regressions:
delta
(e.g. 0.01)delta
(e.g. 0.99)Rmax
(e.g. 0.61)The output of the ovbias_lm()
function is identically
same with the output of the ovbias()
function.
Using the output from ovbias()
,
ovbias_par()
or ovbias_lm()
, users can
construct various plots to visualize the results:
cplotbias()
: contour plot of the bias over the bounded
box; the output of this function is a plot object;dplotbate()
: histogram and density plot of BATE; the
output of this function is a plot object;The methodology proposed in Basu (2022) is slightly different from, and a critique of, Oster (2019). Hence, it might be useful to compare the results of the two methodologies. The methodology proposed in Oster (2019) is implemented via these functions:
osterbds()
: identified sets according to Oster’s
methodology; the output of this function is a data frame;osterdelstar()
: the value of δ* for a chosen value of
Rmax;
the output of this function is a data frame;delfplot()
: a plot of the graph of the function, δ = f(Rmax);
the output of this function is a plot object.Install the package from CRAN and then load it.
Let us load the data set.
The data set has two .RData
objects:
NLSY_IQ
(to be used for the analysis of maternal behavior
on child IQ) and NLSY_BW
(to be used for the analysis of
maternal behavior on child birthweight).
In this example, we will analyse the effect of maternal behavior on
child IQ scores. Let us start out by seeing the names of the variables
in the NLSY_IQ
data set.
names(NLSY_IQ)
#> [1] "iq_std" "BF_months" "mom_drink_preg_all"
#> [4] "lbw_preterm" "age" "female"
#> [7] "black" "motherAge" "motherEDU"
#> [10] "mom_married" "income" "sex"
#> [13] "race"
For use in the example below, let us set age
and
race
as factor variables
Let us use the vtable
package to see the summary
statistics of the variables in our data set.
Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
---|---|---|---|---|---|---|---|
iq_std | 6514 | 0.017 | 0.99 | -4.1 | -0.59 | 0.69 | 3.5 |
BF_months | 6514 | 2.4 | 4.6 | 0 | 0 | 3 | 48 |
mom_drink_preg_all | 6090 | 0.32 | 0.47 | 0 | 0 | 1 | 1 |
lbw_preterm | 5837 | 0.051 | 0.22 | 0 | 0 | 0 | 1 |
age | 6514 | ||||||
… 4 | 1794 | 28% | |||||
… 5 | 1847 | 28% | |||||
… 6 | 1075 | 17% | |||||
… 7 | 922 | 14% | |||||
… 8 | 876 | 13% | |||||
female | 6514 | 0.49 | 0.5 | 0 | 0 | 1 | 1 |
black | 6514 | 0.28 | 0.45 | 0 | 0 | 1 | 1 |
motherAge | 6514 | 25 | 5.7 | 14 | 21 | 28 | 45 |
motherEDU | 6514 | 12 | 2.6 | 1 | 11 | 13 | 95 |
mom_married | 6514 | 0.65 | 0.48 | 0 | 0 | 1 | 1 |
income | 6514 | 40800 | 80453 | 0 | 11474 | 45000 | 974100 |
sex | 6514 | 1.5 | 0.5 | 1 | 1 | 2 | 2 |
race | 6514 | ||||||
… 1 | 1328 | 20% | |||||
… 2 | 1838 | 28% | |||||
… 3 | 3348 | 51% |
We will work with an example where the effect of months of breastfeeding on children’s IQ score is analyzed. Thus, here, the outcome variable is a child’s IQ score and the treatment variable is the months of breastfeeding by the mother. Additional control variables are: sex and age of the child, and the mother’s age, the mother’s years of formal education, whether the mother is married and the race of the mother. For further details of this example, see section 4.2 in Oster (2019). For ease of reference, let us note that we are working with the model reported in the first row of Table 3 in Oster (2019) and the first block of 4 rows in Table 2 in Basu (2022).
Using the names of variables in the data set, we have: * short
regression: iq_std ~ BF_months + sex + age
* intermediate
regression:
iq_std ~ BF_months + sex + age + income + motherAge + motherEDU + mom_married + race
.
Let us use the collect_par()
function to collect
parameters from the short, intermediate and auxiliary regressions. It is
important to note that other_parameters
option in this
function should refer to a subset of control
. If the
researcher fails to ensure this, the collect_par()
function
will throw an error.
parameters <- bate::collect_par(data=NLSY_IQ,
outcome="iq_std",
treatment="BF_months",
control=c("age","sex","income","motherAge","motherEDU","mom_married","race"),
other_regressors = c("sex","age"))
Let us see the parameters by looking at the object
parameters
that we used to store the output of the
collect_par()
function.
(parameters)
#> beta0 R0 betatilde Rtilde sigmay sigmax taux
#> BF_months 0.04447926 0.04465201 0.01740748 0.255621 0.9900242 4.629618 18.99883
Our next task is to choose the dimensions of the bounded box over which we want the bias computation to be carried out. It is here that the researcher needs to use domain knowledge to set limits over which δ and Rmax can run.
# Upper bound of Rmax
Rhigh <- 0.61
# Lower bound of delta
deltalow <- 0.01
# Upper bound of delta
deltahigh <- 0.99
# step size to construct grid
e <- 0.01
Note that while setting the dimensions of the bounded box, we have
not chosen a value for Rlow
(lower bound for Rmax).
This is because Rlow
is chosen by default to be equal to
parameters$Rtilde
(the R-squared in the intermediate
regression).
Let us see the division of the bounded box into the URR (unique real
root) and NURR (nonunique real root) regions using the
urrplot()
function.
bate::urrplot(parameters = parameters, deltalow = deltalow,
deltahigh = deltahigh, Rlow = parameters$Rtilde,
Rhigh = 0.61, e=0.01)
Now we can use the ovbias()
function to compute the
empirical distribution of omitted variable bias and BATE. Note that this
step make take a few minutes, depending on the dimensions of the box and
the size of e
, to complete itself. As the function works,
it will print a message in the console informing the user of the
progress of the computation. Here, I have suppressed these messages.
OVB <- bate::ovbias(
parameters = parameters,
deltalow=deltalow,
deltahigh=deltahigh,
Rhigh=Rhigh,
e=e)
Once the computation is completed, we can see the quantiles of omitted variable bias
and quantiles of the BATE (computed over the bounded box we chose above).
We can create a contour plot of BATE over the bounded box using the
cplotbias()
function.
We can create the histogram and density plot of the β*, the bias-adjusted
treatment effect using the dplotbate()
function.
dplotbate(OVB$Data)
#> Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
#> ℹ Please use `after_stat(density)` instead.
#> ℹ The deprecated feature was likely used in the bate package.
#> Please report the issue to the authors.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
Let us compare our results with the methods proposed in Oster (2019). The methodology proposed in Oster (2019) relies on computing two things: (a) identified sets, and (b) δ*.
Let us compute the identified sets.
bate::osterbds(parameters = parameters, Rmax=0.61)
#> Discriminant Interval1 Interval2
#> "20.3210250174806" "[0.0174,0.3752]" "[-0.0337,0.0174]"
The output from the above function contains three things: the value of the discriminant of the quadratic equation that is solved to compute the identified sets, and the two identified sets. The identified set is the interval formed by β̃ (treatment effect in the intermediate regression) and β* = β̃ − ν, where ν is a root of a quadratic equation.
It has been shown in Proposition 5 in Basu (2022) that the discriminant of this quadratic will always be positive. Hence, there will be two real roots of the quadratic. Hence, there will be two identified sets, instead of a unique identified set. That is why we see two identified sets in the result above.
Let us also compute δ*, the value of δ that is associated with a given
value of Rmax
(in this case Rmax=0.61
) such
that the treatment effect is zero.
bate::osterdelstar(parameters = parameters, Rmax=0.61)
#> deltastar discontinuity slope
#> "0.36619461884058" "FALSE" "Negative"
In addition to the value of δ*, the output has two other things:
discontinuity
: this tells us whether the interval
formed by R̃ and 1 contains a point of discontinuity; if it is
FALSE
, then the interval does not contain the point of
discontinuity; if it is TRUE
, then the interval contains
the point of discontinuity and the analysis of Oster (2019) should be
avoided; for more details see Section 5.2 in Basu (2022).
slope
: this gives the slope of the graph of the
function δ = f(Rmax);
the slope can be either positive
or negative
and helps in interpreting the meaning of δ*.
The value of δ*
printed above is just one value picked up from the function δ = f(Rmax),
for the specific choice of Rmax.
To see the graph of this function, we can use the function
delfplot()
.
The red vertical dashed line in the plot identifies the value of Rmax that corresponds to δ = 1. In this example, this value of Rmax is 0.38. This means that if a researcher uses any value of Rmax that is greater than 0.38, she will get a value of δ* that is less than 1, and if she uses a value of Rmax that is less than 0.38, she will get a value of δ* that is greater than 1.
We could have carried out the same analysis using the
ovbias_par()
function.
OVB.par <- ovbias_par(
data=NLSY_IQ,
outcome="iq_std",
treatment="BF_months",
control=c("age","sex","income","motherAge","motherEDU","mom_married","race"),
other_regressors = c("sex","age"),
deltalow=deltalow,
deltahigh=deltahigh,
Rhigh=Rhigh,
e=e)
We can now see the quantiles of omitted variable bias
and quantiles of the BATE (computed over the bounded box we chose above).
We could have also carried out the same analysis using the
ovbias_lm()
function. To use this function, we need to
estimate the short regression
and the intermediate regression
reg_col2 <- lm(
iq_std ~ BF_months + factor(age) + sex +
income + motherAge + motherEDU + mom_married +
factor(race),
data = NLSY_IQ
)
and the auxiliary regression
reg_aux <- lm(
BF_months ~ factor(age) + sex +
income + motherAge + motherEDU + mom_married +
factor(race),
data = NLSY_IQ
)
and then call the ovbias_lm()
function and provide the
three lm
objects created above
OVB.lm <- ovbias_lm(
lm_shrt = reg_col1,
lm_int = reg_col2,
lm_aux = reg_aux,
deltalow=deltalow,
deltahigh=deltahigh,
Rhigh=Rhigh,
e=e)
We can now see the quantiles of omitted variable bias
and quantiles of the BATE (computed over the bounded box we chose above).
Basu, D. (2022). “Bounds for Bias-Adjusted Treatment Effect in Linear Econometric Models.” <arXiv:2203.12431>
Cinelli, C. and Hazlett, C. (2020). Making Sense of Sensitivity: Extending Omitted Variable Bias. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(1):39–67. https://doi.org/10.1111/rssb.12348
Oster, E. (2019). “Unobservable Selection and Coefficient Stability: Theory and Evidence.” Journal of Business & Economic Statistics, 37:2, 187-204, https://doi.org/10.1080/07350015.2016.1227711