Regression Discontinuity Designs

Welcome

Introduction!

Welcome to our seventh tutorial for the Statistics II: Statistical Modeling & Causal Inference (with R) course.

During this week's lecture you were introduced to Regression Discontinuity Designs (RDDs).

In this lab session we will:

  • Leverage visualizations with ggplot2 to explore our discontinuity setups
  • Learn how to model our discontinuity setups under different functional forms with lm()
  • Learn how to model our discontinuity setups under different functional forms with rdrobust::rdrobust()

Packages

# These are the libraries we will use today. Make sure to install them in your console in case you have not done so previously.

library(dplyr) # for data wrangling
library(ggplot2) # for creating plots
library(rdrobust) # for rdrobust()
library(readr) # for loading the .csv data

set.seed(42) # for consistent results

mlda_df <- readr::read_csv("https://raw.githubusercontent.com/seramirezruiz/stats-ii-lab/master/Session%206/data/mlda.csv") # loading data from Mastering Metrics

Example 2. Measuring the long term effects of a conditional cash transfer program on educational achievement

Imagine that you work as a technical adviser for the Ministry of Education in your country. You are tasked to assess whether a Conditional Cash Transfer (CCT) program established decades before yields positive results on the beneficiaries' educational attainment. There is a large amount of evidence which suggests that CCTs encourage households to increase the use of educational services.

You read the guidelines for the program. Families receive a stipend per child provided they keep their them in school and take them for health checks. Additionally, you note that under the rules of the program, beneficiaries are selected based on a household income threshold of €20000. You decide to dive into the data with the idea that a discontinuity is created based on the income threshold. (This example utilizes simulated data)

cct_df <- readr::read_csv("https://raw.githubusercontent.com/seramirezruiz/stats-ii-lab/master/Session%206/data/cct_data.csv") # loading simulated data frame of the program

The dataset consists of:

  • hh_income: household income in euros
  • years_of_schooling: years of schooling for respondent
  • treatment: binary variable indicating whether respondent was a beneficiary of the program

Checking visually whether a sharp-RDD makes sense for the analysis

What we are looking for in this case is whether our €20000 threshold is in fact the cut-off for treatment. That is to say, that only those who had a household income of equal or less than €20000 received the cash transfer.

ggplot(cct_df, 
       aes(x = hh_income, 
           y = years_of_schooling, 
           color = factor(treatment))) + 
  geom_point() + 
  labs(x = "Household Income", 
       y = "Years of Schooling") +
  scale_color_discrete(name = " ", 
                       labels = c("No treatment", "Treatment")) +
  geom_vline(xintercept = 20000, linetype = "dotted") +
  theme_minimal()

ggplot(cct_df, 
       aes(x = hh_income, 
           y = treatment, 
           color = factor(treatment))) + 
  geom_point() + 
  labs(x = "Household Income", 
       y = "Treatment") +
  scale_color_discrete(name = " ", 
                       labels = c("No treatment", "Treatment")) +
  geom_vline(xintercept = 20000, linetype = "dotted") +
  theme_minimal()

We can see from the graph that our €20000 threshold is in fact cutting off the distribution of the treatment. This would make household income a viable forcing variable for a sharp-RDD set-up.


Estimating our model

We can see that the relationship is fairly linear, so we decide to run a linear model with common slope.

# running linear model with common slope
ed_achievement <- lm(years_of_schooling ~ treatment + hh_income, data = cct_df)
stargazer::stargazer(ed_achievement, type = "text")
## 
## ================================================
##                         Dependent variable:     
##                     ----------------------------
##                          years_of_schooling     
## ------------------------------------------------
## treatment                     2.460***          
##                               (0.038)           
##                                                 
## hh_income                     0.001***          
##                              (0.00000)          
##                                                 
## Constant                     -2.648***          
##                               (0.111)           
##                                                 
## ------------------------------------------------
## Observations                   5,000            
## R2                             0.815            
## Adjusted R2                    0.815            
## Residual Std. Error      0.817 (df = 4997)      
## F Statistic         11,008.950*** (df = 2; 4997)
## ================================================
## Note:                *p<0.1; **p<0.05; ***p<0.01

WHAT DO THESE RESULTS TELL US?

In line with our assumptions for linear models with common slope, we consider that treatment effect, \(D_i\), does not depend on the forcing \(X_i\). Hence, we can expect that students who received the treatment get on average 2.4 more years of schooling. We also see that for every €1,000 increase in the household income, students are expected to attain 0.6274 more years of education. (Our \(\beta = -6.274e-04*1000\)).


Getting familiar with LOESS

Locally weighted smoothing is a popular tool used in regression analysis that creates a smooth line through a scatter plot to help you to see relationship between variables and foresee trends. We can introduce it to our ggplot() as a part of geom_smooth by calling method "loess".

ggplot(cct_df, 
       aes(x = hh_income, 
           y = years_of_schooling, 
           color = factor(treatment))) + 
  geom_point(alpha = 0.1) + 
  labs(x = "Household Income", 
       y = "Years of Schooling") +
  geom_smooth(method = "loess") + # instead of lm, we use loess. See the difference? try with lm
  scale_color_discrete(name = " ", 
                       labels = c("No treatment", "Treatment")) +
  geom_vline(xintercept = 20000, linetype = "dotted") +
  theme_minimal()

The LOESS smoothing is not very visible in this relationship because of the way we defined the simulated data. Let's look at how it would look in our drinking age example:

ggplot(mlda_df,
       aes(x = agecell,  
           y = outcome, 
           col = factor(treatment))) +
  geom_point() +
  geom_smooth(method = "loess") +
  labs(title = "LOESS smoothing",
       x = "Forcing variable (Age)",
       y = "Mortality rate from motor vehicle \naccidents (per 100,000)") +
  scale_color_manual(name = "",
                     values = c("#F8766D", "#00BFC4"),
                     labels = c("Control", "Treatment")) +
  theme_minimal()


Violations to the assumptions

You are made aware by a tax expert from your unit that €20000 is the upper-boundary for a very well known tax concession. You are afraid that people may be sorting themselves before the household income cut-off to become beneficiaries of multiple programs. You decide to check your data.

ggplot(cct_df, 
       aes(x = hh_income)) +
  geom_histogram(fill = "#cc0055") +
  labs(title = "Income distribution",
       x = "Household Income",
       y = "Number of respondents") +
  geom_vline(xintercept = 20000, linetype = "dotted") +
  theme_minimal()


This case looks a bit ambiguous. Do you think people are sorting out just before the cut-off? If sorting were to exist which assumptions would be challenged? Would the existence of other programs that have the same threshold affect a causal reading of our results?

We can also apply an exact binomial test. Our interest here is to see whether the distribution around the treshold could exist by chance. In this case, we can check ±1000 around the threshold.

To gather only the units that reported household incomes from €19000 to €21000, we will use a new function dplyr::between().

dplyr::between() is a shortcut for x >= left & x <= right. Let's look at it at work.

cct_df %>% 
  dplyr::filter(dplyr::between(hh_income, 19000, 21000)) %>% #filter values between 19000 and 21000
  dplyr::group_by(treatment) %>%
  dplyr::summarize(n = n()) %>%
  knitr::kable() %>%
  kableExtra::kable_styling()
treatment n
0 520
1 530

We have 530 units just above the threshold and 520 units just below. We can use this information to run our exact binomial test. We seek to understand if the observed distributions deviate from expected distribution of observations into the two categories.

binom.test(one_of_the_n, total_n, p = probability_of_success)

Let's see:

binom.test(520, 1050, p = 0.5) 
## 
##  Exact binomial test
## 
## data:  520 and 1050
## number of successes = 520, number of trials = 1050, p-value = 0.7812
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.4645699 0.5259331
## sample estimates:
## probability of success 
##              0.4952381

WHAT DO THESE RESULTS TELL US?

According to the test, we see that the observed distributions do not deviate from expected distribution of observations into the two categories when we expect that units just around the threshold end up on either group by chance (coin flip logic, i.e., p = 0.5). In other words, we do not find evidence of sorting.

Previous