estimatr is a package in R dedicated to providing fast estimators for social scientists. Estimators are statistical methods for estimating quantities of interest like treatment effects or regression parameters. Many of the estimators included with the R programming language or popular R packages are slow and have default settings that lead to statistically inappropriate estimates. Certain estimators that reflect cutting-edge advances in statistics are not yet implemented in R packages for convenient use. estimatr is designed to solve these problems.
The current estimators we provide are:
lm_robust()
- for fitting linear models with heteroskedasticity/cluster robust standard errors
lm_lin()
- a wrapper for lm_robust()
to simplify interacting centered pre-treatment covariates with a treatment variable
difference_in_means()
- for estimating differences in means with appropriate standard errors for simple, cluster randomized, block randomized, matched-pair designs and more
horvitz_thompson()
- for estimating average treatment effects taking into consideration treatment probabilities or sampling probabilities for simple and cluster randomized designsTo demonstrate basic usage of each of the estimators, I first create some sample data.
library(estimatr)
# Example dataset to be used throughout built using fabricatr and randomizr
dat <- fabricatr::fabricate(
N = 100, # sample size
x = runif(N, 0, 1), # pre-treatment covariate
y0 = rnorm(N, mean = x), # control potential outcome
y1 = y0 + 0.35, # treatment potential outcome
z = randomizr::complete_ra(N), # complete random assignment to treatment
y = ifelse(z, y1, y0), # observed outcome
# We will also consider clustered data
clust = sample(rep(letters[1:20], each = 5)),
z_clust = randomizr::cluster_ra(clust),
y_clust = ifelse(z_clust, y1, y0)
)
head(dat)
ID | x | y0 | y1 | z | y | clust | z_clust | y_clust |
---|---|---|---|---|---|---|---|---|
001 | 0.91 | 1.24 | 1.6 | 0 | 1.24 | k | 1 | 1.59 |
002 | 0.94 | 0.15 | 0.5 | 0 | 0.15 | m | 0 | 0.15 |
003 | 0.29 | 1.86 | 2.2 | 0 | 1.86 | i | 0 | 1.86 |
004 | 0.83 | 1.47 | 1.8 | 1 | 1.82 | r | 1 | 1.82 |
005 | 0.64 | 0.73 | 1.1 | 1 | 1.08 | c | 0 | 0.73 |
006 | 0.52 | 0.80 | 1.1 | 0 | 0.80 | s | 0 | 0.80 |
Results presented in this vignette are rounded to two digits for visual display.
lm_robust
The estimatr
package provides a function to quickly fit linear models with the most common variance estimators and degrees of freedom corrections used in social science. You can easily return heteroskedastic standard errors, clustered standard errors, and classical standard errors.
Usage largely mimics lm()
, although it defaults to using Eicker-Huber-White robust standard errors, specifically ‘HC2’ standard errors:
lm_robust(y ~ z + x, data = dat)
Estimate | Std. Error | Pr(>|t|) | ci_lower | ci_upper | df | |
---|---|---|---|---|---|---|
(Intercept) | -0.19 | 0.21 | 0.37 | -0.60 | 0.22 | 97 |
z | 0.23 | 0.19 | 0.21 | -0.14 | 0.60 | 97 |
x | 1.42 | 0.29 | 0.00 | 0.85 | 1.99 | 97 |
It is straightforward to do cluster-robust inference, by passing the name of your cluster variable to the clusters =
argument. Note that lm_robust()
is much quicker if your cluster variable is a factor!
lm_robust(y_clust ~ z_clust + x,
data = dat,
clusters = clust)
Estimate | Std. Error | Pr(>|t|) | ci_lower | ci_upper | df | |
---|---|---|---|---|---|---|
(Intercept) | -0.48 | 0.22 | 0.05 | -0.97 | 0.0 | 12 |
z_clust | 0.81 | 0.17 | 0.00 | 0.45 | 1.2 | 18 |
x | 1.43 | 0.34 | 0.00 | 0.72 | 2.1 | 17 |
The default variance estimator with clusters is dubbed ‘CR2’ because it is analogous to ‘HC2’ for the clustered case, and utilizes recent advances proposed by (Pustejovsky and Tipton 2016) to correct hypotheses tests for small samples and work with commonly specified fixed effects and weights.
Researchers can also replicate Stata’s clustered standard errors by using the se_type =
argument:
lm_robust(y_clust ~ z_clust + x,
data = dat,
clusters = clust,
se_type = 'stata')
Estimate | Std. Error | Pr(>|t|) | ci_lower | ci_upper | df | |
---|---|---|---|---|---|---|
(Intercept) | -0.48 | 0.22 | 0.04 | -0.95 | -0.02 | 19 |
z_clust | 0.81 | 0.17 | 0.00 | 0.46 | 1.16 | 19 |
x | 1.43 | 0.33 | 0.00 | 0.73 | 2.13 | 19 |
lm_lin
Following the critique by (Freedman 2008) that pre-treatment covariate adjustment biases estimates of average treatment effects, (Lin and others 2013) proposed an alternative estimator that would reduce this bias and improve precision. The (Lin and others 2013) estimator suggests centering all pre-treatment covariates and interacting them with the treatment variable. To facilitate this, we provide a wrapper that can do that process for you.
lm_lin(y ~ z,
covariates = ~ x,
data = dat)
Estimate | Std. Error | Pr(>|t|) | ci_lower | ci_upper | df | |
---|---|---|---|---|---|---|
(Intercept) | 0.55 | 0.15 | 0.00 | 0.25 | 0.84 | 96 |
z | 0.24 | 0.19 | 0.21 | -0.13 | 0.61 | 96 |
x_bar | 1.72 | 0.47 | 0.00 | 0.79 | 2.65 | 96 |
z:x_bar | -0.58 | 0.58 | 0.32 | -1.73 | 0.57 | 96 |
Furthermore, it is just a wrapper for lm_robust()
, so all arguments that work for lm_robust()
work here.
difference_in_means
While estimating differences in means may seem straightforward, we provide a function that appropriately adjusts estimates for blocking and clustering to match the current state of knowledge in social science methodology. Usage is similar to usage in regression functions.
# Simple version
difference_in_means(y ~ z,
data = dat)
Estimate | Std. Error | Pr(>|t|) | ci_lower | ci_upper | df | |
---|---|---|---|---|---|---|
z | 0.16 | 0.2 | 0.44 | -0.25 | 0.56 | 90 |
# Clustered version
difference_in_means(y_clust ~ z_clust,
data = dat,
clusters = clust)
Estimate | Std. Error | Pr(>|t|) | ci_lower | ci_upper | df | |
---|---|---|---|---|---|---|
z_clust | 0.82 | 0.17 | 0 | 0.47 | 1.2 | 54 |
horvitz_thompson
Horvitz-Thompson estimators are useful when estimating treatment effects without bias when there is a clustered design with unequal size clusters or when the treatment assignment process is arbitrarily complex. Horvitz-Thompson estimators require information about the treatment (and control) probabilities for each unit, and the joint treatment and control probabilities of every pair of units. For simpler designs, passing your randomization scheme using randomizr::declare_ra()
is the easiest path forward.
# Complete random assignment declaration
crs_decl <- randomizr::declare_ra(N = nrow(dat),
prob = 0.5,
simple = FALSE)
horvitz_thompson(y ~ z,
data = dat,
declaration = crs_decl)
# Clustered random assignment declaration
crs_clust_decl <- randomizr::declare_ra(
N = nrow(dat),
clust_var = dat$clust,
prob = 0.5,
simple = FALSE
)
horvitz_thompson(y_clust ~ z_clust,
data = dat,
declaration = crs_clust_decl)
Estimate | Std. Error | Pr(>|t|) | ci_lower | ci_upper | df | |
---|---|---|---|---|---|---|
z_clust | 0.82 | 0.13 | 0 | 0.55 | 1.1 | 98 |
There are two main variance estimators we provide, a conservative estimator using Young’s inequality (se_type = "youngs"
), described by (Aronow and Middleton 2013), and an estimator that assumes constant treatment effects (se_type = "constant"
). The default is the conservative estimator.
horvitz_thompson(
y_clust ~ z_clust,
data = dat,
declaration = crs_clust_decl,
se_type = "constant"
)
Estimate | Std. Error | Pr(>|t|) | ci_lower | ci_upper | df | |
---|---|---|---|---|---|---|
z_clust | 0.82 | 0.17 | 0 | 0.48 | 1.1 | 98 |
You can also build the condition probability matrix (condition_prob_mat =
) that horvitz_thompson()
needs yourself. This matrix is an 2N by 2N matrix, where the first N rows and N columns (the upper-left corner) are the joint probability of each unit being in the control condition with every other unit being in the control condition. The upper right N by N matrix is the joint probability of each unit being in the control condition with every other unit being in the treated condition. The second N rows are the same, with the joint probability of each unit being in the treated condition first with each unit being in the control (first N columns) and then with each unit being in the treated as well (second N columns).
We also provide helper functions to build this matrix from any given permutation of treatments (`condition.
# Generate 500 arbitrary permutations
permutes <- matrix(rbinom(nrow(dat) * 500,
size = 1,
prob = 0.5),
nrow = nrow(dat))
dim(permutes)
permutes[1:5, 1:2]
# Get condition probability matrix
arb_pr_mat <- permutations_to_condition_pr_mat(permutes)
horvitz_thompson(y ~ z,
data = dat,
condition_pr_mat = arb_pr_mat)
Estimate | Std. Error | Pr(>|t|) | ci_lower | ci_upper | df | |
---|---|---|---|---|---|---|
z | 0.16 | 0.24 | 0.52 | -0.32 | 0.63 | 98 |
Aronow, Peter M, and Joel A Middleton. 2013. “A Class of Unbiased Estimators of the Average Treatment Effect in Randomized Experiments.” Journal of Causal Inference 1 (1): 135–54. https://doi.org/10.1515/jci-2012-0009.
Freedman, David A. 2008. “On Regression Adjustments in Experiments with Several Treatments.” The Annals of Applied Statistics. JSTOR, 176–96. https://doi.org/10.1214/07-AOAS143.
Lin, Winston, and others. 2013. “Agnostic Notes on Regression Adjustments to Experimental Data: Reexamining Freedman’s Critique.” The Annals of Applied Statistics 7 (1). Institute of Mathematical Statistics: 295–318. https://doi.org/10.1214/12-AOAS583.
Pustejovsky, James E, and Elizabeth Tipton. 2016. “Small Sample Methods for Cluster-Robust Variance Estimation and Hypothesis Testing in Fixed Effects Models.” Journal of Business & Economic Statistics. Taylor & Francis. https://doi.org/10.1080/07350015.2016.1247004.