This formula fits a linear model, provides a variety of options for robust standard errors, and conducts coefficient tests

lm_robust(formula, data, weights, subset, clusters, se_type = NULL,
ci = TRUE, alpha = 0.05, return_vcov = TRUE, try_cholesky = FALSE)

Arguments

formula an object of class formula, as in lm A data.frame the bare (unquoted) names of the weights variable in the supplied data. An optional bare (unquoted) expression specifying a subset of observations to be used. An optional bare (unquoted) name of the variable that corresponds to the clusters in the data. The sort of standard error sought. If clusters is not specified the options are "HC0", "HC1" (or "stata", the equivalent), "HC2" (default), "HC3", or "classical". If clusters is specified the options are "CR0", "CR2" (default), or "stata" are permissible. logical. Whether to compute and return p-values and confidence intervals, TRUE by default. The significance level, 0.05 by default. logical. Whether to return the variance-covariance matrix for later usage, TRUE by default. logical. Whether to try using a Cholesky decomposition to solve least squares instead of a QR decomposition, FALSE by default. Using a Cholesky decomposition may result in speed gains, but should only be used if users are sure their model is full-rank (i.e., there is no perfect multi-collinearity)

Value

An object of class "lm_robust".

The post-estimation commands functions summary and tidy return results in a data.frame. To get useful data out of the return, you can use these data frames, you can use the resulting list directly, or you can use the generic accessor functions coef, vcov, confint, and predict. Marginal effects and uncertainty about them can be gotten by passing this object to margins from the margins.

Users who want to print the results in TeX of HTML can use the extract function and the texreg package.

An object of class "lm_robust" is a list containing at least the following components:

coefficients

the estimated coefficients

se

the estimated standard errors

df

the estimated degrees of freedom

p

the p-values from a two-sided t-test using coefficients, se, and df

ci_lower

the lower bound of the 1 - alpha percent confidence interval

ci_upper

the upper bound of the 1 - alpha percent confidence interval

coefficient_name

a character vector of coefficient names

alpha

the significance level specified by the user

res_var

the residual variance

N

the number of observations used

k

the number of columns in the design matrix (includes linearly dependent columns!)

rank

the rank of the fitted model

vcov

the fitted variance covariance matrix

r.squared

The $$R^2$$, $$R^2 = 1 - Sum(e[i]^2) / Sum((y[i] - y^*)^2),$$ where $$y^*$$ is the mean of $$y[i]$$ if there is an intercept and zero otherwise, and $$e[i]$$ is the ith residual.

The $$R^2$$ but penalized for having more parameters, rank

weighted

whether or not weights were applied

call

the original function call

We also return terms and contrasts, used by predict.

Details

This function performs linear regression and provides a variety of standard errors. It takes a formula and data much in the same was as lm does, and all auxiliary variables, such as clusters and weights, can be passed either as quoted names of columns, as bare column names, or as a self-contained vector. Examples of usage can be seen below and in the Getting Started vignette.

The mathematical notes in this vignette specify the exact estimators used by this function. The default variance estimators have been chosen largely in accordance with the procedures in this manual. The default for the case without clusters is the HC2 estimator and the default with clusters is the analogous CR2 estimator. Users can easily replicate Stata standard errors in the clustered or non-clustered case by setting se_type = "stata".

The function estimates the coefficients and standard errors in C++, using the RcppEigen package. By default, we estimate the coefficients using Column-Pivoting QR decomposition from the Eigen C++ library, although users could get faster solutions by setting try_cholesky = TRUE to use a Cholesky decomposition instead. This will likely result in quicker solutions, but the algorithm does not reliably detect when there are linear dependencies in the model and may fail silently if they exist.

References

Abadie, Alberto, Susan Athey, Guido W Imbens, and Jeffrey Wooldridge. 2017. "A Class of Unbiased Estimators of the Average Treatment Effect in Randomized Experiments." arXiv Pre-Print. https://arxiv.org/abs/1710.02926v2.

Bell, Robert M, and Daniel F McCaffrey. 2002. "Bias Reduction in Standard Errors for Linear Regression with Multi-Stage Samples." Survey Methodology 28 (2): 169-82.

MacKinnon, James, and Halbert White. 1985. "Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties." Journal of Econometrics 29 (3): 305-25. https://doi.org/10.1016/0304-4076(85)90158-7.

Pustejovsky, James E, and Elizabeth Tipton. 2016. "Small Sample Methods for Cluster-Robust Variance Estimation and Hypothesis Testing in Fixed Effects Models." Journal of Business & Economic Statistics. Taylor & Francis. https://doi.org/10.1080/07350015.2016.1247004.

Samii, Cyrus, and Peter M Aronow. 2012. "On Equivalencies Between Design-Based and Regression-Based Variance Estimators for Randomized Experiments." Statistics and Probability Letters 82 (2). https://doi.org/10.1016/j.spl.2011.10.024.

Examples

library(fabricatr)
dat <- fabricate(
N = 40,
y = rpois(N, lambda = 4),
x = rnorm(N),
z = rbinom(N, 1, prob = 0.4)
)

# Default variance estimator is HC2 robust standard errors
lmro <- lm_robust(y ~ x + z, data = dat)

# Can tidy() the data in to a data.frame
tidy(lmro)#>   coefficient_name coefficients        se            p   ci_lower  ci_upper df
#> 1      (Intercept)   3.67846151 0.3826625 1.331206e-11  2.9031137 4.4538094 37
#> 2                x   0.01582338 0.3657818 9.657276e-01 -0.7253210 0.7569677 37
#> 3                z   0.58982078 0.6245145 3.510682e-01 -0.6755657 1.8552073 37
#>   outcome
#> 1       y
#> 2       y
#> 3       y# Can use summary() to get more statistics
summary(lmro)#>
#> Call:
#> lm_robust(formula = y ~ x + z, data = dat)
#>
#> Standard error type =  HC2
#>
#> Coefficients:
#>             Estimate Std. Error  Pr(>|t|) CI Lower CI Upper DF
#> (Intercept)  3.67846     0.3827 1.331e-11   2.9031    4.454 37
#> x            0.01582     0.3658 9.657e-01  -0.7253    0.757 37
#> z            0.58982     0.6245 3.511e-01  -0.6756    1.855 37
#>
#> Multiple R-squared:  0.02422 ,	Adjusted R-squared:  -0.02852
#> F-statistic: 0.4592 on 2 and 37 DF,  p-value: 0.6353# Can also get coefficients three ways
lmro$coefficients#> (Intercept) x z #> 3.67846151 0.01582338 0.58982078 coef(lmro)#> (Intercept) x z #> 3.67846151 0.01582338 0.58982078 tidy(lmro)$coefficients#> [1] 3.67846151 0.01582338 0.58982078# Can also get confidence intervals from object or with new 1 - alpha
lmro$ci_lower#> [1] 2.9031137 -0.7253210 -0.6755657confint(lmro, level = 0.8)#> 10 % 90 % #> (Intercept) 3.1791427 4.1777803 #> x -0.4614686 0.4931154 #> z -0.2250796 1.4047212 # Can recover classical standard errors lmclassic <- lm_robust(y ~ x + z, data = dat, se_type = "classical") tidy(lmclassic)#> coefficient_name coefficients se p ci_lower ci_upper df #> 1 (Intercept) 3.67846151 0.3765766 8.661954e-12 2.9154448 4.4414782 37 #> 2 x 0.01582338 0.3044257 9.588260e-01 -0.6010018 0.6326485 37 #> 3 z 0.58982078 0.6160405 3.445607e-01 -0.6583959 1.8380374 37 #> outcome #> 1 y #> 2 y #> 3 y # Can easily match Stata's robust standard errors lmstata <- lm_robust(y ~ x + z, data = dat, se_type = "stata") tidy(lmstata)#> coefficient_name coefficients se p ci_lower ci_upper df #> 1 (Intercept) 3.67846151 0.3810271 1.187262e-11 2.9064274 4.450496 37 #> 2 x 0.01582338 0.3529490 9.644824e-01 -0.6993193 0.730966 37 #> 3 z 0.58982078 0.6220572 3.491911e-01 -0.6705869 1.850228 37 #> outcome #> 1 y #> 2 y #> 3 y # Easy to specify clusters for cluster-robust inference dat$clusterID <- sample(1:10, size = 40, replace = TRUE)

lmclust <- lm_robust(y ~ x + z, data = dat, clusters = clusterID)
tidy(lmclust)#>   coefficient_name coefficients        se            p  ci_lower ci_upper
#> 1      (Intercept)   3.67846151 0.3667644 2.540429e-05  2.806069 4.550854
#> 2                x   0.01582338 0.4242528 9.716757e-01 -1.071038 1.102685
#> 3                z   0.58982078 0.6651378 4.094710e-01 -1.039309 2.218951
#>         df outcome
#> 1 6.802426       y
#> 2 5.057445       y
#> 3 5.975842       y
# Can also match Stata's clustered standard errors
lm_robust(
y ~ x + z,
data = dat,
clusters = clusterID,
se_type = "stata"
)#>               Estimate Std. Error     Pr(>|t|)   CI Lower  CI Upper DF
#> (Intercept) 3.67846151  0.3650948 3.361353e-06  2.8525597 4.5043633  9
#> x           0.01582338  0.4053142 9.697110e-01 -0.9010611 0.9327078  9
#> z           0.58982078  0.6425863 3.826179e-01 -0.8638104 2.0434520  9
# Works just as LM does with functions in the formula
dat$blockID <- rep(c("A", "B", "C", "D"), each = 10) lm_robust(y ~ x + z + factor(blockID), data = dat)#> Estimate Std. Error Pr(>|t|) CI Lower CI Upper DF #> (Intercept) 3.28425607 0.4883760 1.001618e-07 2.2917566 4.2767556 34 #> x 0.02454756 0.3350361 9.420219e-01 -0.6563278 0.7054229 34 #> z 0.55472639 0.7157915 4.437063e-01 -0.8999370 2.0093898 34 #> factor(blockID)B 1.09053563 0.9200820 2.441333e-01 -0.7792959 2.9603672 34 #> factor(blockID)C -0.06185656 0.8560830 9.428221e-01 -1.8016265 1.6779134 34 #> factor(blockID)D 0.59999985 0.8081074 4.628984e-01 -1.0422720 2.2422716 34 # Weights are also easily specified dat$w <- runif(40)

lm_robust(
y ~ x + z,
data = dat,
weights = w,
clusters = clusterID
)#>                 Estimate Std. Error     Pr(>|t|)  CI Lower CI Upper       DF
#> (Intercept)  3.547433364  0.4040461 0.0002226048  2.529323 4.565544 5.357866
#> x           -0.002655767  0.3956012 0.9949295325 -1.052716 1.047404 4.522808
#> z            0.696123507  0.7074789 0.3711587223 -1.133237 2.525484 4.904283
# Subsetting works just as in lm()
lm_robust(y ~ x, data = dat, subset = z == 1)#>              Estimate Std. Error     Pr(>|t|)  CI Lower CI Upper DF
#> (Intercept) 4.2882796  0.5267980 1.847702e-06  3.150202 5.426357 13
#> x           0.2116763  0.6905727 7.640608e-01 -1.280215 1.703568 13
# One can also choose to set the significance level for different CIs
lm_robust(y ~ x + z, data = dat, alpha = 0.1)#>               Estimate Std. Error     Pr(>|t|)   CI Lower  CI Upper DF
#> (Intercept) 3.67846151  0.3826625 1.331206e-11  3.0328741 4.3240490 37
#> x           0.01582338  0.3657818 9.657276e-01 -0.6012848 0.6329315 37
#> z           0.58982078  0.6245145 3.510682e-01 -0.4637936 1.6434351 37
# NOT RUN {
# Can also use 'margins' package if you have it installed to get
# marginal effects
library(margins)
lmrout <- lm_robust(y ~ x + z, data = dat)
summary(margins(lmrout))

# Can output results using 'texreg'
library(texreg)
texregobj <- extract(lmrout)
# }