The Backdoor Criterion and Basics of Regression in R
Welcome
Introduction!
Welcome to our fourth tutorial for the Statistics II: Statistical Modeling & Causal Inference (with R) course.
During this week's lecture you reviewed bivariate and multiple linear regressions. You also learned how Directed Acyclic Graphs (DAGs) can be leveraged to gather causal estimates.
In this lab session we will:
- Use the
ggdag
anddagitty
packages to assess your modeling strategy - Review how to run regression models using R
- Illustrate omitted variable and collider bias
Packages
# These are the libraries we will use today. Make sure to install them in your console in case you have not done so previously.
set.seed(42) #for consistent results in randomization
library(wooldridge) # To get our example's dataset
library(tidyverse) # To use dplyr functions and the pipe operator when needed
library(ggplot2) # To visualize data (this package is also loaded by library(tidyverse))
library(ggdag) # To dagify and plot our DAG objects in R
library(dagitty) # To perform analysis in our DAG objects in R
library(stargazer) # To render nicer regression output
data("wage1") # calls the wage1 dataset from the woorldridge package
Working with DAGs in R
Last week we learned about the general syntax of the ggdag
package:
- We created dagified objects with
ggdag::dagify()
- We plotted our DAGs with
ggdag::ggdag()
- We discussed how to specify the coordinates of our nodes with a coordinate list
Today, we will learn how the ggdag
and dagitty
packages can help us illustrate our paths and adjustment sets to fulfill the backdoor criterion
Let's take one of the DAGs from our review slides:
coord_dag <- list(
x = c(d = 0, p = 0, b = 1, a = 1 , c = 2, y = 2),
y = c(d = 0, p = 2, b = 1, a = -1, c = 2, y = 0)
)
our_dag <- ggdag::dagify(d ~ p + a, # p and a pointing at d
b ~ p + c, # p and c pointing at b
y ~ d + a + c, # d, a, and c pointing at y
coords = coord_dag, # our coordinates from the list up there
exposure = "d", # we declare out exposure variable
outcome = "y") # we declare out outcome variable
ggdag::ggdag(our_dag) +
theme_dag() # equivalent to theme_void()
Learning about our paths and what adjustments we need
As you have seen, when we dagify a DAG in R a dagitty object is created. These objects tell R that we are dealing with DAGs.
This is very important because in addition to plotting them, we can do analyses on the DAG objects. A package that complements ggdag
is the dagitty
package.
Today, we will focus on two functions from the dagitty
package:
dagitty::paths()
: Returns a list with two components: paths, which gives the actual paths, and open, which shows whether each path is open (d-connected) or closed (d-separated).dagitty::adjustmentSets()
: Lists the sets of covariates that would allow for unbiased estimation of causal effects, assuming that the causal graph is correct.
We just need to input our DAG object.
Paths
Let's see how the output of the dagitty::paths
function looks like:
dagitty::paths(our_dag)
## $paths
## [1] "d -> y" "d <- a -> y" "d <- p -> b <- c -> y"
##
## $open
## [1] TRUE TRUE FALSE
We see under $paths
the three paths we declared during the manual exercise:
- d -> y
- d <- a -> y
- d <- p -> b <- c -> y
Additionally, $open
tells us whether each path is open. In this case, we see that the second path is the only open non-causal path, so we would need to condition on a to close it.
We can also use ggdag
to present the open paths visually with the ggdag_paths()
function, as such:
ggdag::ggdag_paths(our_dag) +
theme_dag()
Covariate adjustment
In addition to listing all the paths and sorting the backdoors manually, we can use the dagitty::adjustmentSets()
function.
With this function, we just need to input our DAG object and it will return the different sets of adjustments.
dagitty::adjustmentSets(our_dag)
## { a }
For example, in this DAG there is only one option. We need to control for a.
We can also use ggdag
to present the open paths visually with the ggdag_adjustment_set()
function, as such:
Also, do not forget to set the argument shadow = TRUE
, so that the arrows from the adjusted nodes are included.
ggdag::ggdag_adjustment_set(our_dag, shadow = T) +
theme_dag()
If you want to learn more about DAGs in R
ggdag
documentation: https://ggdag.malco.io/dagitty
vignette: https://cran.r-project.org/web/packages/dagitty/dagitty.pdf- What is `dagitty: https://cran.r-project.org/web/packages/dagitty/vignettes/dagitty4semusers.html
NOW LET'S MOVE TO REGRESSION
Introduction to Regression
Linear regression is largely used to predict the value of an outcome variable based on one or more input explanatory variables. As we previously discussed, regression addresses a simple mechanical problem, namely, what is our best guess of y given an observed x.
- Regression can be utilized without thinking about causes as a predictive or summarizing tool
- It would not be appropiate to give causal interpretations to any \(\beta\), unless we establish the fulfilment of centain assumptions
Bivariate regression
In bivariate regression, we are modeling a variable \(y\) as a mathematical function of one variable \(x\). We can generalize this in a mathematical equation as such:
\[y = \beta_{0} + \beta{1}x + ϵ\]
Multiple linear regression
In multiple linear regression, we are modeling a variable \(y\) as a mathematical function of multiple variables \((x, z, m)\). We can generalize this in a mathematical equation as such:
\[y = \beta_{0} + \beta_{1}x + \beta_{2}z + \beta_{3}m + ϵ\]
Exploratory questions
Let's illustrate this with an example
We will use the wage1
dataset from the wooldridge
package. These are data from the 1976 Current Population Survey used by Jeffrey M. Wooldridge with pedagogical purposes in his book on Introductory Econometrics.
If you want to check the contents of the wage1
data frame, you can type ?wage1
in your console
Visualizing
With regression we can answer EXPLORATORY QUESTIONS. For example:
What is the relationship between education and respondents' salaries?
We can start by exploring the relationship visually with our newly attained ggplot2
skills:
ggplot(wage1, aes(x = educ, y = wage)) +
geom_point(color = "grey60") +
geom_smooth(method = "lm", se = F, color = "#CC0055") +
theme_minimal() +
labs(x = "Years of education",
y = "Hourly wage (USD)")
The lm()
function
This question can be formalized mathematically as:
\[Hourly\ wage = \beta_0 + \beta_1Years\ of\ education + ϵ\]
Our interest here would be to build a model that predicts the hourly wage of a respondent (our outcome variable) using the years of education (our explanatory variable). Fortunately for us, R provides us with a very intuitive syntax to model regressions.
The general syntax for running a regression model in R is the following:
your_model_biv <- lm(outcome_variable ~ explanarory_variable, data = your_dataset) #for a bivariate regression
your_model_mult <- lm(outcome_variable ~ explanarory_variable_1 + explanarory_variable_2, data = your_dataset) #for multiple regression
Now let's create our own model and save it into the model_1
object, based on the bivariate regression we specified above in which wage
is our outcome variable, educ
is our explanatory variable, and our data come from the wage1
object:
model_1 <- lm(wage ~ educ, data = wage1)
summary()
and broom::tidy()
We have created an object that contains the coefficients, standard errors and further information from your model. In order to see the estimates, you could use the base R function summary()
. This function is very useful when you want to print your results in your console.
Alternatively, you can use the tidy()
function from the broom
package. The function constructs a data frame that summarizes the model’s statistical findings. You can see what else you can do with broom by running: vignette(“broom”). The broom::tidy()
function is useful when you want to store the values for future use (e.g., visualizing them)
Let's try both options in the console up there. You just need to copy this code below the model_1
code.
summary(model_1)
broom::tidy(model_1)
Exercise
\[Hourly\ wage = \beta_0 + \beta_1Years\ of\ education + ϵ\]
Dependent variable: | |
Hourly wage | |
Years of education | 0.541*** |
(0.053) | |
Constant | -0.905 |
(0.685) | |
Observations | 526 |
R2 | 0.165 |
Adjusted R2 | 0.163 |
Residual Std. Error | 3.378 (df = 524) |
F Statistic | 103.363*** (df = 1; 524) |
Note: | p<0.1; p<0.05; p<0.01 |
How would you interpret the results of our model_1
?
- What does the constant mean?
- What does the
educ
coefficient mean? - Do these coefficient carry any causal meaning?
Adding more nuance to our models
As we have discussed in previous sessions we live in a very complex world. It is very likely that our exploration of the relationship between education and respondents' salaries is open to multiple sources of bias.
Looking back at 1976 US, can you think of possible variables inside the mix?
How about the sex or the ethnicity of a worker?
Let's explore this visually
What is the relationship between education and respondents' salaries conditional on the sex of the worker?
ggplot(wage1, aes(x = educ, y = wage, color = as.factor(female))) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = F) +
theme_minimal() +
labs(x = "Years of education",
y = "Hourly wage (USD)",
color = "Female") +
theme(legend.position = "bottom")
Check what happens when we replace the color = as.factor(female)
for color = female
- What insights can we gather from this graph?
Multiple linear regression
This question can be formalized mathematically as:
\[Hourly\ wage = \beta_0 + \beta_1Years\ of\ education + \beta_2Female + ϵ\]
Our interest here would be to build a model that predicts the hourly wage of a respondent (our outcome variable) using the years of education and their sex (our explanatory variables).
Let's remember the syntax for running a regression model in R:
your_model_biv <- lm(outcome_variable ~ explanarory_variable, data = your_dataset) #for a bivariate regression
your_model_mult <- lm(outcome_variable ~ explanarory_variable_1 + explanarory_variable_2, data = your_dataset) #for multiple regression
Now let's create our own model, save it into the model_2
object, and print the results based on the formula regression we specified above in which wage
is our outcome variable, educ
and female
are our explanatory variables, and our data come from the wage1
object:
model_2 <- lm(wage ~ educ + female, data = wage1)
summary(model_2)
##
## Call:
## lm(formula = wage ~ educ + female, data = wage1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9890 -1.8702 -0.6651 1.0447 15.4998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.62282 0.67253 0.926 0.355
## educ 0.50645 0.05039 10.051 < 2e-16 ***
## female -2.27336 0.27904 -8.147 2.76e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.186 on 523 degrees of freedom
## Multiple R-squared: 0.2588, Adjusted R-squared: 0.256
## F-statistic: 91.32 on 2 and 523 DF, p-value: < 2.2e-16
Exercise
\[Hourly\ wage = \beta_0 + \beta_1Years\ of\ education + \beta_2Female + ϵ\]
Dependent variable: | |
Hourly wage | |
Years of education | 0.506*** |
(0.050) | |
Female | -2.273*** |
(0.279) | |
Constant | 0.623 |
(0.673) | |
Observations | 526 |
R2 | 0.259 |
Adjusted R2 | 0.256 |
Residual Std. Error | 3.186 (df = 523) |
F Statistic | 91.315*** (df = 2; 523) |
Note: | p<0.1; p<0.05; p<0.01 |
How would you interpret the results of our model_2
?
- What does the constant mean?
- What does the
educ
coefficient mean? - What does the
female
coefficient mean?
Predicting from our models
As we discussed previously, when we do not have our causal inference hats on, the main goal of linear regression is to predict an outcome value on the basis of one or multiple predictor variables.
R has a generic function predict()
that helps us arrive at the predicted values on the basis of our explanatory variables.
The syntax of predict()
is the following:
predict(name_of_the_model, newdata = data.frame(explanatory1 = value, explanatory2 = value))
Say that based on our
model_2
, we are interested in the expected average hourly wage of a woman with 15 years of education.
predict(model_2, newdata = data.frame(educ = 15, female = 1))
## 1
## 5.946237
- What does this result tell us?
- What happens when you change
female
to 0? What does the result mean? - Can you think of a way to find the difference in the expected hourly wage between a male with 16 years of education and a female with 17?
predict(model_2, newdata = data.frame(educ = 16, female = 0)) - predict(model_2, newdata = data.frame(educ = 15, female =0))
## 1
## 0.5064521
Quiz
Here are some questions for you. Note that there are multiple ways to reach the same answer:
What is the expected hourly wage of a male with 15 years of education?
- $8.22
- $9.50
- 5.34
- $3
How much more on average does a male worker earn than a female counterpart?",
- $2.27
- In our data, males on average earn less than females
- $1.20
- $4.50
How much more is a worker expected to earn for every additional year of education, keeping sex constant?
- $0.90
- $1.20
- $0.5
DAGs and modeling
As we can remember from our slides, we were introduced to a set of key rules in understanding how to employ DAGs to guide our modeling strategy.
- A path is open or unblocked at non-colliders (confounders or mediators)
- A path is (naturally) blocked at colliders
- An open path induces statistical association between two variables
- Absence of an open path implies statistical independence
- Two variables are d-connected if there is an open path between them
- Two variables are d-separated if the path between them is blocked
In this portion of the tutorial we will demonstrate how different bias come to work when we model our relationships of interest.
What happens when we control for a collider?
The case for beauty, talent, and celebrity (What happens when we control for a collider?)
As it is showcased from our DAG, we assume that earning celebrity status is a function of an individuals beauty and talent.
We will simulate data that reflects this assumptions. In our world, someone gains celebrity status if the sum of units of beauty and celebrity are greater than 8.
# beauty - 1000 observations with mean 5 units of beauty and sd 1.5 (arbitrary scale)
beauty <- rnorm(1000, 5, 1.5)
# talent - 1000 observations with mean 3 units of talent and sd 1 (arbitrary scale)
talent <- rnorm(1000, 3, 1)
# celebrity - binary
celebrity_status <- ifelse(beauty + talent > 8, "Celebrity" , "Not Celebrity") # celebrity if the sum of units are greater than 8
celebrity_df <- dplyr::tibble(beauty, talent, celebrity_status) # we make a df with our values
head(celebrity_df, 10)
## # A tibble: 10 x 3
## beauty talent celebrity_status
## <dbl> <dbl> <chr>
## 1 7.06 5.33 Celebrity
## 2 4.15 3.52 Not Celebrity
## 3 5.54 3.97 Celebrity
## 4 5.95 3.38 Celebrity
## 5 5.61 2.00 Not Celebrity
## 6 4.84 2.40 Not Celebrity
## 7 7.27 3.17 Celebrity
## 8 4.86 0.0715 Not Celebrity
## 9 8.03 2.15 Celebrity
## 10 4.91 3.80 Celebrity
In this case, as our simulation suggest, we have a collider structure. We can see that celebrity can be a function of beauty or talent. Also, we can infer from the way we defined the variables that beauty and talent are d-separated (ie. the path between them is closed because celebrity is a collider).
Say you are interested in researching the relationship between beauty and talent for your Master's thesis, while doing your literature review you encounter a series of papers that find a negative relationship between the two and state that more beautiful people tend to be less talented. The model that these teams of the researchers used was the following:
\[Y_{Talent} = \beta_0 + \beta_1Beauty + \beta_2Celebrity\]
Your scientific hunch makes you believe that celebrity is a collider and that by controlling for it in their models, the researchers are inducing collider bias, or endogenous bias. You decide to move forward with your thesis by laying out a criticism to previous work on the field, given that you consider the formalization of their models is erroneous. You utilize the same data previous papers used, but based on your logic, you do not control for celebrity status. This is what you find:
True model
true_model_celebrity <- lm(talent ~ beauty, data = celebrity_df)
summary(true_model_celebrity)
##
## Call:
## lm(formula = talent ~ beauty, data = celebrity_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9225 -0.6588 -0.0083 0.6628 3.5877
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.962209 0.107595 27.531 <2e-16 ***
## beauty 0.006545 0.020755 0.315 0.753
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9865 on 998 degrees of freedom
## Multiple R-squared: 9.964e-05, Adjusted R-squared: -0.0009023
## F-statistic: 0.09945 on 1 and 998 DF, p-value: 0.7526
ggplot(celebrity_df, aes(x=beauty,
y=talent)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
theme_minimal() +
theme(legend.position = "bottom") +
labs(x = "Beauty",
y = "Talent")
Biased model from previous literature
Let's see:
biased_model_celibrity <- lm(talent ~ beauty + celebrity_status, data = celebrity_df)
summary(biased_model_celibrity)
##
## Call:
## lm(formula = talent ~ beauty + celebrity_status, data = celebrity_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4244 -0.5394 0.0110 0.5064 2.9429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.37834 0.13983 38.46 <2e-16 ***
## beauty -0.32668 0.02265 -14.43 <2e-16 ***
## celebrity_statusNot Celebrity -1.51375 0.06808 -22.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.807 on 997 degrees of freedom
## Multiple R-squared: 0.3316, Adjusted R-squared: 0.3302
## F-statistic: 247.3 on 2 and 997 DF, p-value: < 2.2e-16
ggplot(celebrity_df, aes(x=beauty, y=talent, color = celebrity_status)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
theme_minimal() +
theme(legend.position = "bottom") +
labs(x = "Beauty",
y = "Talent",
color = "")
As we can see, by controlling for a collider, the previous literature was inducing to a non-existent association between beauty and talent, also known as collider or endogenous bias.
What happens when we fail to control for a confounder?
Shoe size and salary (What happens when we fail to control for a confounder?)
# sex - replicate male and female 500 times each
sex <- rep(c("Male", "Female"), each = 500)
# shoe size - random number with mean 38 and sd 4, plus 4 if the observation is male
shoesize <- rnorm(1000, 38, 2) + (4 * as.numeric(sex == "Male"))
# salary - a random number with mean 25 and sd 2, plus 5 if the observation is male
salary <- rnorm(1000, 25, 2) + (5 * as.numeric(sex == "Male"))
salary_df <- dplyr::tibble(sex, shoesize, salary)
head(salary_df, 10)
## # A tibble: 10 x 3
## sex shoesize salary
## <chr> <dbl> <dbl>
## 1 Male 42.5 28.6
## 2 Male 41.4 28.4
## 3 Male 38.6 29.2
## 4 Male 38.0 27.7
## 5 Male 39.4 32.2
## 6 Male 42.7 28.2
## 7 Male 41.7 32.6
## 8 Male 40.5 27.1
## 9 Male 40.4 31.7
## 10 Male 43.1 28.8
Say now one of your peers tells you about this new study that suggests that shoe size has an effect on an individuals' salary. You are a bit skeptic and read it. The model that these researchers apply is the following:
\[Y_{Salary} = \beta_0 + \beta_1ShoeSize\]
Your scientific hunch makes you believe that this relationship could be confounded by the sex of the respondent. You think that by failing to control for sex in their models, the researchers are inducing omitted variable bias. You decide to open their replication files and control for sex. This is what you find:
\[Y_{Salary} = \beta_0 + \beta_1ShoeSize + \beta_2Sex\]
True model
true_model_salary <- lm(salary ~ shoesize + sex, data = salary_df)
summary(true_model_salary)
##
## Call:
## lm(formula = salary ~ shoesize + sex, data = salary_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2341 -1.3698 -0.0501 1.3595 6.4303
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.73879 1.15886 22.210 <2e-16 ***
## shoesize -0.02030 0.03044 -0.667 0.505
## sexMale 5.05924 0.17616 28.720 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.981 on 997 degrees of freedom
## Multiple R-squared: 0.6129, Adjusted R-squared: 0.6121
## F-statistic: 789.2 on 2 and 997 DF, p-value: < 2.2e-16
ggplot(salary_df, aes(x=shoesize, y=salary, color = sex)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
theme_minimal() +
theme(legend.position = "bottom") +
labs(x = "Shoe size",
y = "Salary",
color = "")
Biased model from previous literature
biased_model_salary <- lm(salary ~ shoesize, data = salary_df)
summary(biased_model_salary)
##
## Call:
## lm(formula = salary ~ shoesize, data = salary_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8777 -1.9101 -0.0511 1.8496 7.9774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.68865 1.17280 3.145 0.00171 **
## shoesize 0.59429 0.02925 20.319 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.676 on 998 degrees of freedom
## Multiple R-squared: 0.2926, Adjusted R-squared: 0.2919
## F-statistic: 412.9 on 1 and 998 DF, p-value: < 2.2e-16
ggplot(salary_df, aes(x=shoesize, y=salary)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
theme_minimal() +
labs(x = "Shoe size",
y = "Salary")
As we can see, by failing to control for a confounder, the previous literature was creating a non-existent association between shoe size and salary, incurring in ommited variable bias.