The Backdoor Criterion and Basics of Regression in R

Welcome

Introduction!

Welcome to our fourth tutorial for the Statistics II: Statistical Modeling & Causal Inference (with R) course.

During this week's lecture you reviewed bivariate and multiple linear regressions. You also learned how Directed Acyclic Graphs (DAGs) can be leveraged to gather causal estimates.

In this lab session we will:

  • Use the ggdag and dagitty packages to assess your modeling strategy
  • Review how to run regression models using R
  • Illustrate omitted variable and collider bias

Packages

# These are the libraries we will use today. Make sure to install them in your console in case you have not done so previously.
set.seed(42) #for consistent results in randomization
library(wooldridge) # To get our example's dataset 
library(tidyverse) # To use dplyr functions and the pipe operator when needed
library(ggplot2) # To visualize data (this package is also loaded by library(tidyverse))
library(ggdag) # To dagify and plot our DAG objects in R
library(dagitty) # To perform analysis in our DAG objects in R
library(stargazer) # To render nicer regression output
data("wage1") # calls the wage1 dataset from the woorldridge package

Working with DAGs in R

Last week we learned about the general syntax of the ggdag package:

  • We created dagified objects with ggdag::dagify()
  • We plotted our DAGs with ggdag::ggdag()
  • We discussed how to specify the coordinates of our nodes with a coordinate list

Today, we will learn how the ggdag and dagitty packages can help us illustrate our paths and adjustment sets to fulfill the backdoor criterion

Let's take one of the DAGs from our review slides:

coord_dag <- list(
  x = c(d = 0, p = 0, b = 1, a = 1 , c = 2, y = 2),
  y = c(d = 0, p = 2, b = 1, a = -1, c = 2, y = 0)
)

our_dag <- ggdag::dagify(d ~ p + a, # p and a pointing at d
                         b ~ p + c, # p and c pointing at b
                         y ~ d + a + c, # d, a, and c pointing at y
                         coords = coord_dag, # our coordinates from the list up there
                         exposure = "d", # we declare out exposure variable
                         outcome = "y") # we declare out outcome variable

ggdag::ggdag(our_dag) + 
  theme_dag() # equivalent to theme_void()


Learning about our paths and what adjustments we need

As you have seen, when we dagify a DAG in R a dagitty object is created. These objects tell R that we are dealing with DAGs.

This is very important because in addition to plotting them, we can do analyses on the DAG objects. A package that complements ggdag is the dagitty package.

Today, we will focus on two functions from the dagitty package:

  • dagitty::paths(): Returns a list with two components: paths, which gives the actual paths, and open, which shows whether each path is open (d-connected) or closed (d-separated).
  • dagitty::adjustmentSets(): Lists the sets of covariates that would allow for unbiased estimation of causal effects, assuming that the causal graph is correct.

We just need to input our DAG object.


Paths

Let's see how the output of the dagitty::paths function looks like:

dagitty::paths(our_dag)
## $paths
## [1] "d -> y"                "d <- a -> y"           "d <- p -> b <- c -> y"
## 
## $open
## [1]  TRUE  TRUE FALSE

We see under $paths the three paths we declared during the manual exercise:

  • d -> y
  • d <- a -> y
  • d <- p -> b <- c -> y

Additionally, $open tells us whether each path is open. In this case, we see that the second path is the only open non-causal path, so we would need to condition on a to close it.


We can also use ggdag to present the open paths visually with the ggdag_paths() function, as such:

ggdag::ggdag_paths(our_dag) +
  theme_dag()


Covariate adjustment

In addition to listing all the paths and sorting the backdoors manually, we can use the dagitty::adjustmentSets() function.

With this function, we just need to input our DAG object and it will return the different sets of adjustments.

dagitty::adjustmentSets(our_dag)
## { a }

For example, in this DAG there is only one option. We need to control for a.


We can also use ggdag to present the open paths visually with the ggdag_adjustment_set() function, as such:

Also, do not forget to set the argument shadow = TRUE, so that the arrows from the adjusted nodes are included.

ggdag::ggdag_adjustment_set(our_dag, shadow = T) +
  theme_dag()


If you want to learn more about DAGs in R

NOW LET'S MOVE TO REGRESSION


Introduction to Regression

Linear regression is largely used to predict the value of an outcome variable based on one or more input explanatory variables. As we previously discussed, regression addresses a simple mechanical problem, namely, what is our best guess of y given an observed x.

  • Regression can be utilized without thinking about causes as a predictive or summarizing tool
  • It would not be appropiate to give causal interpretations to any \(\beta\), unless we establish the fulfilment of centain assumptions

Bivariate regression

In bivariate regression, we are modeling a variable \(y\) as a mathematical function of one variable \(x\). We can generalize this in a mathematical equation as such:

\[y = \beta_{0} + \beta{1}x + ϵ\]


Multiple linear regression

In multiple linear regression, we are modeling a variable \(y\) as a mathematical function of multiple variables \((x, z, m)\). We can generalize this in a mathematical equation as such:

\[y = \beta_{0} + \beta_{1}x + \beta_{2}z + \beta_{3}m + ϵ\]


Exploratory questions

Let's illustrate this with an example

We will use the wage1 dataset from the wooldridge package. These are data from the 1976 Current Population Survey used by Jeffrey M. Wooldridge with pedagogical purposes in his book on Introductory Econometrics.

If you want to check the contents of the wage1 data frame, you can type ?wage1 in your console


Visualizing

With regression we can answer EXPLORATORY QUESTIONS. For example:

What is the relationship between education and respondents' salaries?

We can start by exploring the relationship visually with our newly attained ggplot2 skills:

ggplot(wage1, aes(x = educ, y = wage)) +
  geom_point(color = "grey60") +
  geom_smooth(method = "lm", se = F, color = "#CC0055") +
  theme_minimal() +
  labs(x = "Years of education",
       y = "Hourly wage (USD)") 


The lm() function

This question can be formalized mathematically as:

\[Hourly\ wage = \beta_0 + \beta_1Years\ of\ education + ϵ\]

Our interest here would be to build a model that predicts the hourly wage of a respondent (our outcome variable) using the years of education (our explanatory variable). Fortunately for us, R provides us with a very intuitive syntax to model regressions.

The general syntax for running a regression model in R is the following:

your_model_biv <- lm(outcome_variable ~ explanarory_variable, data = your_dataset) #for a bivariate regression
your_model_mult <- lm(outcome_variable ~ explanarory_variable_1 + explanarory_variable_2, data = your_dataset) #for multiple regression

Now let's create our own model and save it into the model_1 object, based on the bivariate regression we specified above in which wage is our outcome variable, educ is our explanatory variable, and our data come from the wage1 object:

model_1 <- lm(wage ~ educ, data = wage1)

summary() and broom::tidy()

We have created an object that contains the coefficients, standard errors and further information from your model. In order to see the estimates, you could use the base R function summary(). This function is very useful when you want to print your results in your console.

Alternatively, you can use the tidy() function from the broom package. The function constructs a data frame that summarizes the model’s statistical findings. You can see what else you can do with broom by running: vignette(“broom”). The broom::tidy() function is useful when you want to store the values for future use (e.g., visualizing them)

Let's try both options in the console up there. You just need to copy this code below the model_1 code.

summary(model_1)
broom::tidy(model_1)

Exercise

\[Hourly\ wage = \beta_0 + \beta_1Years\ of\ education + ϵ\]

Dependent variable:
Hourly wage
Years of education 0.541***
(0.053)
Constant -0.905
(0.685)
Observations 526
R2 0.165
Adjusted R2 0.163
Residual Std. Error 3.378 (df = 524)
F Statistic 103.363*** (df = 1; 524)
Note: p<0.1; p<0.05; p<0.01

How would you interpret the results of our model_1?

  • What does the constant mean?
  • What does the educ coefficient mean?
  • Do these coefficient carry any causal meaning?

Adding more nuance to our models

As we have discussed in previous sessions we live in a very complex world. It is very likely that our exploration of the relationship between education and respondents' salaries is open to multiple sources of bias.

Looking back at 1976 US, can you think of possible variables inside the mix?

How about the sex or the ethnicity of a worker?


Let's explore this visually

What is the relationship between education and respondents' salaries conditional on the sex of the worker?
ggplot(wage1, aes(x = educ, y = wage, color = as.factor(female))) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  labs(x = "Years of education",
       y = "Hourly wage (USD)",
       color = "Female") +
  theme(legend.position = "bottom")

Check what happens when we replace the color = as.factor(female) for color = female

  • What insights can we gather from this graph?

Multiple linear regression

This question can be formalized mathematically as:

\[Hourly\ wage = \beta_0 + \beta_1Years\ of\ education + \beta_2Female + ϵ\]

Our interest here would be to build a model that predicts the hourly wage of a respondent (our outcome variable) using the years of education and their sex (our explanatory variables).

Let's remember the syntax for running a regression model in R:

your_model_biv <- lm(outcome_variable ~ explanarory_variable, data = your_dataset) #for a bivariate regression
your_model_mult <- lm(outcome_variable ~ explanarory_variable_1 + explanarory_variable_2, data = your_dataset) #for multiple regression

Now let's create our own model, save it into the model_2 object, and print the results based on the formula regression we specified above in which wage is our outcome variable, educ and female are our explanatory variables, and our data come from the wage1 object:

model_2 <- lm(wage ~ educ + female, data = wage1)
summary(model_2)
## 
## Call:
## lm(formula = wage ~ educ + female, data = wage1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9890 -1.8702 -0.6651  1.0447 15.4998 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.62282    0.67253   0.926    0.355    
## educ         0.50645    0.05039  10.051  < 2e-16 ***
## female      -2.27336    0.27904  -8.147 2.76e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.186 on 523 degrees of freedom
## Multiple R-squared:  0.2588, Adjusted R-squared:  0.256 
## F-statistic: 91.32 on 2 and 523 DF,  p-value: < 2.2e-16

Exercise

\[Hourly\ wage = \beta_0 + \beta_1Years\ of\ education + \beta_2Female + ϵ\]

Dependent variable:
Hourly wage
Years of education 0.506***
(0.050)
Female -2.273***
(0.279)
Constant 0.623
(0.673)
Observations 526
R2 0.259
Adjusted R2 0.256
Residual Std. Error 3.186 (df = 523)
F Statistic 91.315*** (df = 2; 523)
Note: p<0.1; p<0.05; p<0.01

How would you interpret the results of our model_2?

  • What does the constant mean?
  • What does the educ coefficient mean?
  • What does the female coefficient mean?

Predicting from our models

As we discussed previously, when we do not have our causal inference hats on, the main goal of linear regression is to predict an outcome value on the basis of one or multiple predictor variables.

R has a generic function predict() that helps us arrive at the predicted values on the basis of our explanatory variables.

The syntax of predict() is the following:

predict(name_of_the_model, newdata = data.frame(explanatory1 = value, explanatory2 = value))

Say that based on our model_2, we are interested in the expected average hourly wage of a woman with 15 years of education.

predict(model_2, newdata = data.frame(educ = 15, female = 1))
##        1 
## 5.946237
  • What does this result tell us?
  • What happens when you change female to 0? What does the result mean?
  • Can you think of a way to find the difference in the expected hourly wage between a male with 16 years of education and a female with 17?
predict(model_2, newdata = data.frame(educ = 16, female = 0)) - predict(model_2, newdata = data.frame(educ = 15, female =0))
##         1 
## 0.5064521

Quiz

Here are some questions for you. Note that there are multiple ways to reach the same answer:

What is the expected hourly wage of a male with 15 years of education?

  1. $8.22
  2. $9.50
  3. 5.34
  4. $3

How much more on average does a male worker earn than a female counterpart?",

  1. $2.27
  2. In our data, males on average earn less than females
  3. $1.20
  4. $4.50

How much more is a worker expected to earn for every additional year of education, keeping sex constant?

  1. $0.90
  2. $1.20
  3. $0.5

DAGs and modeling

As we can remember from our slides, we were introduced to a set of key rules in understanding how to employ DAGs to guide our modeling strategy.

  • A path is open or unblocked at non-colliders (confounders or mediators)
  • A path is (naturally) blocked at colliders
  • An open path induces statistical association between two variables
  • Absence of an open path implies statistical independence
  • Two variables are d-connected if there is an open path between them
  • Two variables are d-separated if the path between them is blocked

In this portion of the tutorial we will demonstrate how different bias come to work when we model our relationships of interest.


What happens when we control for a collider?

The case for beauty, talent, and celebrity (What happens when we control for a collider?)

As it is showcased from our DAG, we assume that earning celebrity status is a function of an individuals beauty and talent.

We will simulate data that reflects this assumptions. In our world, someone gains celebrity status if the sum of units of beauty and celebrity are greater than 8.

# beauty - 1000 observations with mean 5 units of beauty and sd 1.5 (arbitrary scale)
beauty <- rnorm(1000, 5, 1.5)

# talent - 1000 observations with mean 3 units of talent and sd 1 (arbitrary scale)
talent <- rnorm(1000, 3, 1)

# celebrity - binary
celebrity_status <-  ifelse(beauty + talent > 8, "Celebrity" , "Not Celebrity") # celebrity if the sum of units  are greater than 8

celebrity_df <- dplyr::tibble(beauty, talent, celebrity_status) # we make a df with our values

head(celebrity_df, 10)
## # A tibble: 10 x 3
##    beauty talent celebrity_status
##     <dbl>  <dbl> <chr>           
##  1   7.06 5.33   Celebrity       
##  2   4.15 3.52   Not Celebrity   
##  3   5.54 3.97   Celebrity       
##  4   5.95 3.38   Celebrity       
##  5   5.61 2.00   Not Celebrity   
##  6   4.84 2.40   Not Celebrity   
##  7   7.27 3.17   Celebrity       
##  8   4.86 0.0715 Not Celebrity   
##  9   8.03 2.15   Celebrity       
## 10   4.91 3.80   Celebrity

In this case, as our simulation suggest, we have a collider structure. We can see that celebrity can be a function of beauty or talent. Also, we can infer from the way we defined the variables that beauty and talent are d-separated (ie. the path between them is closed because celebrity is a collider).

Say you are interested in researching the relationship between beauty and talent for your Master's thesis, while doing your literature review you encounter a series of papers that find a negative relationship between the two and state that more beautiful people tend to be less talented. The model that these teams of the researchers used was the following:

\[Y_{Talent} = \beta_0 + \beta_1Beauty + \beta_2Celebrity\]

Your scientific hunch makes you believe that celebrity is a collider and that by controlling for it in their models, the researchers are inducing collider bias, or endogenous bias. You decide to move forward with your thesis by laying out a criticism to previous work on the field, given that you consider the formalization of their models is erroneous. You utilize the same data previous papers used, but based on your logic, you do not control for celebrity status. This is what you find:

True model

true_model_celebrity <- lm(talent ~ beauty, data = celebrity_df)
summary(true_model_celebrity)
## 
## Call:
## lm(formula = talent ~ beauty, data = celebrity_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9225 -0.6588 -0.0083  0.6628  3.5877 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.962209   0.107595  27.531   <2e-16 ***
## beauty      0.006545   0.020755   0.315    0.753    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9865 on 998 degrees of freedom
## Multiple R-squared:  9.964e-05,  Adjusted R-squared:  -0.0009023 
## F-statistic: 0.09945 on 1 and 998 DF,  p-value: 0.7526
ggplot(celebrity_df, aes(x=beauty, 
                         y=talent)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(x = "Beauty",
       y = "Talent")

Biased model from previous literature

Let's see:

biased_model_celibrity <- lm(talent ~ beauty + celebrity_status, data = celebrity_df)
summary(biased_model_celibrity)
## 
## Call:
## lm(formula = talent ~ beauty + celebrity_status, data = celebrity_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4244 -0.5394  0.0110  0.5064  2.9429 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    5.37834    0.13983   38.46   <2e-16 ***
## beauty                        -0.32668    0.02265  -14.43   <2e-16 ***
## celebrity_statusNot Celebrity -1.51375    0.06808  -22.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.807 on 997 degrees of freedom
## Multiple R-squared:  0.3316, Adjusted R-squared:  0.3302 
## F-statistic: 247.3 on 2 and 997 DF,  p-value: < 2.2e-16
ggplot(celebrity_df, aes(x=beauty, y=talent, color = celebrity_status)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(x = "Beauty",
       y = "Talent",
       color = "")

As we can see, by controlling for a collider, the previous literature was inducing to a non-existent association between beauty and talent, also known as collider or endogenous bias.


What happens when we fail to control for a confounder?

Shoe size and salary (What happens when we fail to control for a confounder?)

# sex - replicate male and female 500 times each
sex <- rep(c("Male", "Female"), each = 500) 

# shoe size - random number with mean 38 and sd 4, plus 4 if the observation is male
shoesize <- rnorm(1000, 38, 2) +  (4 * as.numeric(sex == "Male"))

# salary - a random number with mean 25 and sd 2, plus 5 if the observation is male
salary <- rnorm(1000, 25, 2) + (5 * as.numeric(sex == "Male"))

salary_df <- dplyr::tibble(sex, shoesize, salary)

head(salary_df, 10)
## # A tibble: 10 x 3
##    sex   shoesize salary
##    <chr>    <dbl>  <dbl>
##  1 Male      42.5   28.6
##  2 Male      41.4   28.4
##  3 Male      38.6   29.2
##  4 Male      38.0   27.7
##  5 Male      39.4   32.2
##  6 Male      42.7   28.2
##  7 Male      41.7   32.6
##  8 Male      40.5   27.1
##  9 Male      40.4   31.7
## 10 Male      43.1   28.8

Say now one of your peers tells you about this new study that suggests that shoe size has an effect on an individuals' salary. You are a bit skeptic and read it. The model that these researchers apply is the following:

\[Y_{Salary} = \beta_0 + \beta_1ShoeSize\]

Your scientific hunch makes you believe that this relationship could be confounded by the sex of the respondent. You think that by failing to control for sex in their models, the researchers are inducing omitted variable bias. You decide to open their replication files and control for sex. This is what you find:

\[Y_{Salary} = \beta_0 + \beta_1ShoeSize + \beta_2Sex\]


True model

true_model_salary <- lm(salary ~ shoesize + sex, data = salary_df)
summary(true_model_salary)
## 
## Call:
## lm(formula = salary ~ shoesize + sex, data = salary_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2341 -1.3698 -0.0501  1.3595  6.4303 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.73879    1.15886  22.210   <2e-16 ***
## shoesize    -0.02030    0.03044  -0.667    0.505    
## sexMale      5.05924    0.17616  28.720   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.981 on 997 degrees of freedom
## Multiple R-squared:  0.6129, Adjusted R-squared:  0.6121 
## F-statistic: 789.2 on 2 and 997 DF,  p-value: < 2.2e-16
ggplot(salary_df, aes(x=shoesize, y=salary, color = sex)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(x = "Shoe size",
       y = "Salary",
       color = "")


Biased model from previous literature

biased_model_salary <- lm(salary ~ shoesize, data = salary_df)
summary(biased_model_salary)
## 
## Call:
## lm(formula = salary ~ shoesize, data = salary_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8777 -1.9101 -0.0511  1.8496  7.9774 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.68865    1.17280   3.145  0.00171 ** 
## shoesize     0.59429    0.02925  20.319  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.676 on 998 degrees of freedom
## Multiple R-squared:  0.2926, Adjusted R-squared:  0.2919 
## F-statistic: 412.9 on 1 and 998 DF,  p-value: < 2.2e-16
ggplot(salary_df, aes(x=shoesize, y=salary)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  labs(x = "Shoe size",
       y = "Salary")

As we can see, by failing to control for a confounder, the previous literature was creating a non-existent association between shoe size and salary, incurring in ommited variable bias.

Previous