Matching in R

Welcome

Introduction!

Welcome to our fifth tutorial for the Statistics II: Statistical Modeling & Causal Inference (with R) course.

During this week's lecture you reviewed randomization in experimental setups. You also learned how matching can be leveraged to gather causal estimates.

In this lab session we will:

  • Take a step back to review how to compare the means of two groups in R
  • Learn how to perform matching with the MatchIt package
  • Illustrate the mechanics of propensity score matching with gml()

Packages

# These are the libraries we will use today. Make sure to install them in your console in case you have not done so previously.

library(tidyverse) # To use dplyr functions and the pipe operator when needed
library(ggplot2) # To create plots (this package is also loaded by library(tidyverse))
library(purrr) # To repeat code across our list in the balance table purrr::map() (this package is also loaded by library(tidyverse))
library(broom) # To format regression output
library(stargazer) # To format model output
library(knitr) # To create HTML tables with kable()
library(kableExtra) # To format the HTML tables
library(MatchIt) # To match

Our data

Today we will work with data from the Early Childhood Longitudinal Study (ECLS).

ecls_df <- 
  readr::read_csv("https://raw.githubusercontent.com/seramirezruiz/stats-ii-lab/master/Session%204/data/ecls.csv") %>% 
  dplyr::select(-childid, -race, -w3daded,
                -w3momed, -w3inccat) #drop these columns (-)

names(ecls_df) #checking the names of the variables in the data frame
##  [1] "catholic"      "race_white"    "race_black"    "race_hispanic"
##  [5] "race_asian"    "p5numpla"      "p5hmage"       "p5hdage"      
##  [9] "w3daded_hsb"   "w3momed_hsb"   "w3momscr"      "w3dadscr"     
## [13] "w3income"      "w3povrty"      "p5fstamp"      "c5r2mtsc"     
## [17] "c5r2mtsc_std"

(Example inspired by Simon Ejdemyr: https://sejdemyr.github.io/r-tutorials/statistics/tutorial8.html)


Comparing Differences in Means and Balance Tables

As you may have seen in this week's application paper, balance tables feature prominently in work that utilizes matching.

In binary treatment setups, balance tables present an overview of the difference in means for the groups accross covariates.

Let's take the dataset as an example to review how to compare differences in means and build balance tables in R. There are multiple ways to explore these kinds of questions. In this lab we will leverage t-tests to check the statistical significance of the difference in means.

In other words, we want to know whether the observed differences in average value of a variable between the two groups or samples can be due to random chance (our null hypothesis).


T-tests in R

In this case, let's imagine we want to check the statistical significance of the differences in the composition of the catholic and public school samples in the w3povrty (under the poverty line) variable.

The general syntax for a t-test is simply as follows. The vectors refer to the variables whose mean you want to compare.

t.test(y ~ x, data = your_data)

Exercise

t.test(w3povrty ~ catholic, data = ecls_df)
## 
##  Welch Two Sample t-test
## 
## data:  w3povrty by catholic
## t = 26.495, df = 4034.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1678107 0.1946307
## sample estimates:
## mean in group 0 mean in group 1 
##      0.21918633      0.03796562
  • What does the group 0 mean tell us?
  • What does the group 1 mean tell us?
  • Is the difference between catholic school and public school students statistically significant?

Balance tables in R

Now let's take a look at how can we create a simple and good looking balance table. The idea here is to compare the mean values of different variables across populations or groups. In our case, let's see whether the catholic and public school groups differ across the covariates in the dataset:

Here is one way to create the table (you can adapt this code for the assignment).

# create a list with the covariates
list_cov <- c("race_white", "race_black", "race_hispanic", "race_asian", "p5numpla",
              "p5hmage", "p5hdage", "w3daded_hsb", "w3momed_hsb", "w3momscr", "w3dadscr",
              "w3income", "w3povrty", "p5fstamp", "c5r2mtsc", "c5r2mtsc_std") 


ecls_df %>% # our data frame
  dplyr::summarize_at(list_cov, funs(list(broom::tidy(t.test(. ~ catholic))))) %>% # sequentially run t-tests across all the covariates in the list_cov (note that you have to change the "treatment")
  purrr::map(1) %>% # maps into a list
  dplyr::bind_rows(.id='variables') %>% # binds list into a single data frame and names the id column "variables" 
  dplyr::select(variables, estimate1, estimate2, p.value) %>% # select only the names, group means, and p-values
  dplyr::mutate_if(is.numeric, round, 3) %>% # round numeric variables to three places
  knitr::kable(col.names = c("Variable", "(Catholic = 0)", "(Catholic = 1)", "P value")) %>% # create kable table and remane headings
  kableExtra::kable_styling() # style kable table for our knitted document
Variable (Catholic = 0) (Catholic = 1) P value
race_white 0.556 0.725 0.000
race_black 0.136 0.054 0.000
race_hispanic 0.181 0.131 0.000
race_asian 0.066 0.052 0.025
p5numpla 1.133 1.093 0.000
p5hmage 37.561 39.575 0.000
p5hdage 40.392 41.988 0.000
w3daded_hsb 0.488 0.259 0.000
w3momed_hsb 0.464 0.227 0.000
w3momscr 43.114 47.492 0.000
w3dadscr 42.713 46.356 0.000
w3income 54889.159 82074.301 0.000
w3povrty 0.219 0.038 0.000
p5fstamp 0.129 0.015 0.000
c5r2mtsc 50.209 52.389 0.000
c5r2mtsc_std -0.031 0.194 0.000

Exercise

  • Are the differences in means significant at conventional levels?
  • What differences strike you from the composition of the two samples?

The Effect of Catholic School on Student Achievement

In this tutorial we’ll analyze the effect of going to Catholic school, as opposed to public school, on student achievement. Because students who attend Catholic school on average are different from students who attend public school, we will use matching to get more credible causal estimates of Catholic schooling.


Exploration of the data

ecls_df %>%
  dplyr::group_by(catholic) %>%
  dplyr::summarize(n_students = n(),
            avg_math = mean(c5r2mtsc_std),
            std_error = sd(c5r2mtsc_std) / sqrt(n_students)) %>% 
  round(3) %>% # round the results
  knitr::kable() %>% # create kable table
  kableExtra::kable_styling() # view kable table
catholic n_students avg_math std_error
0 9568 -0.031 0.010
1 1510 0.194 0.022

We can see that we have many more students that did not attend Catholic school than those who did. Also, the Catholic school students have a higher average math score.


Naive Average Treatment Effect (NATE)

We can naively compare the students on their standardized math scores (c5r2mtsc_std). As you remember, the Naive Average Treatment Effect (NATE) is the difference in the means of the observed outcomes of the two groups:

\[NATE = E(y^1 | D = 1) - E(y^0 | D = 0)\]

In this case, the NATE would be difference between the average math scores.


NATE manually

Do you remember what we did in the last section?

We could substract the avg_math for the catholic = 1 and the avg_math for the catholic = 0

\[NATE = 0.194 - (-0.031)\] \[NATE = 0.194 + 0.031 = 0.225\]


NATE with t.test()

t.test(c5r2mtsc_std ~ catholic, data = ecls_df)
## 
##  Welch Two Sample t-test
## 
## data:  c5r2mtsc_std by catholic
## t = -9.1069, df = 2214.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2727988 -0.1761292
## sample estimates:
## mean in group 0 mean in group 1 
##     -0.03059583      0.19386817

NATE with lm()

lm(c5r2mtsc_std ~ catholic, data = ecls_df) %>% 
  stargazer::stargazer(.,type = "html")
Dependent variable:
c5r2mtsc_std
catholic 0.224***
(0.028)
Constant -0.031***
(0.010)
Observations 11,078
R2 0.006
Adjusted R2 0.006
Residual Std. Error 0.997 (df = 11076)
F Statistic 66.096*** (df = 1; 11076)
Note: p<0.1; p<0.05; p<0.01

Exercise

  • What parallels do you find between substracting the manually extracted means, the t-test, and the linear regression?

Matching with MatchIt

MatchIt is designed for causal inference with a dichotomous treatment variable and a set of pretreatment control variables. Any number or type of dependent variables can be used.

We have three steps:

  • Perform the match with MatchIt::matchit()
  • Create a new data frame with the matched data with MatchIt::match.data()
  • Model

The basic syntax is as follows:

match_process <- MatchIt::matchit(treatment ~ x1 + x2, data = mydata) # NOTE. We include treatment ~ covariates
matched_df <- MatchIt::match.data(match_process)
matched_model <- lm(outcome ~ trearment, data = matched_df)

where treat is the dichotomous treatment variable, and x1 and x2 are pre-treatment co-variates, all of which are contained in the data frame mydata.

As you can see, the outcome variable is not included in the matching procedure.


MatchIt is capable of using several matching methods:

  • Exact (method = "exact"): The simplest version of matching is exact. This technique matches each treated unit to all possible control units with exactly the same values on all the covariates, forming subclasses such that within each subclass all units (treatment and control) have the same covariate values.

  • Subclassification (method = "subclass"): When there are many covariates (or some covariates can take a large number of values), finding sufficient exact matches will often be impossible. The goal of subclassification is to form subclasses, such that in each of them the distribution (rather than the exact values) of covariates for the treated and control groups are as similar as possible.

  • Nearest Neighbor (method = "nearest"): Nearest neighbor matching selects the best control matches for each individual in the treatment group. Matching is done using a distance measure (propensity score) specified by the distance option (default = logit).

  • As well as optimal matching, full matching, genetic matching, and coarsened exact matching, all of which are detailed in the documentation.

A few additional arguments are important to know about:

  • distance: this refers to propensity scores. There are many options for how to calculate these within MatchIt.

  • discard: specifies whether to discard units that fall outside some measure of support of the distance measure (default is "none", discard no units). For example, if some treated units have extremely high propensity scores that are higher than any control units, we could drop those.

  • replace: a logical value indicating whether each control unit can be matched to more than one treated unit (default is replace = FALSE, each control unit is used at most once).

  • ratio: the number of control units to match to each treated unit (default is ratio = 1).

  • There are also some optional arguments for most of the matching methods, which you can read about in the documentation if you are interested.


Exact Matching

We can use a combination of the results from our balance table and theory to identify which variables to use for matching. Let's perform an exact match with:

  • race_white: Is the student white (1) or not (0)?
  • p5hmage: Mother’s age
  • w3income: Family income
  • p5numpla: Number of places the student has lived for at least 4 months
  • w3momed_hsb: Is the mother’s education level high-school or below (1) or some college or more (0)?
# first we must omit missing values (MatchIt does not allow missings)
match_data <- ecls_df %>% 
  dplyr::select(catholic, c5r2mtsc_std, race_white, p5hmage, 
                w3income, p5numpla, w3momed_hsb) %>% 
  na.omit() 


# perform exact match
exact_match <- MatchIt::matchit(catholic ~ race_white + p5hmage + w3income +
                                p5numpla + w3momed_hsb, 
                                method = "exact", 
                                data = match_data)

# Try seeing the output in the console with summary(exact_match)

# grab the matched data into a new data frame
data_exact_match <- MatchIt::match.data(exact_match)

# estimate effect again with new data frame
exact_match_model <- lm(c5r2mtsc_std ~ catholic, data = data_exact_match)
stargazer::stargazer(exact_match_model, type = "html")
Dependent variable:
c5r2mtsc_std
catholic -0.101***
(0.030)
Constant 0.319***
(0.015)
Observations 5,405
R2 0.002
Adjusted R2 0.002
Residual Std. Error 0.934 (df = 5403)
F Statistic 11.340*** (df = 1; 5403)
Note: p<0.1; p<0.05; p<0.01

Now we can see that the mean in the group that did not attend Catholic school is actually about 0.10 higher than the mean for those who did. The results are statistically significant given that the confidence interval does not contain zero, and we have a fairly small p-value.


Propensity Scores

If we want to perform non-exact matching, we can use propensity scores. We can generate these manually using a logit model on the unmatched data set.

When we extract propensity scores, we model the propensity of each unit of falling under the treatment group based on their values on a set of covariates. This is how we would do this manually:

# create a new column with income by the thousands for more interpretable output
ecls_df <- ecls_df %>% 
  dplyr::mutate(w3income_1k = w3income / 1000) 

# estimate logit model
m_ps <- glm(catholic ~ race_white + w3income_1k + 
            p5hmage + p5numpla + w3momed_hsb,
            family = binomial(link = "logit"), # you can also use a probit link here
            data = ecls_df)

# extract predicted probabilities
# type = "response" option tells R to output probabilities of the form P(Y = 1|X)
prs_df <- dplyr::tibble(pr_score = predict(m_ps, type = "response"),
                     catholic = m_ps$model$catholic) # the actual values

Let's plot the propensity scores by treatment group to explore common support:

# Histogram
ggplot(prs_df, aes(x = pr_score, fill = factor(catholic))) +
  geom_histogram(alpha = 0.5) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(title = "Propensity Score Distribution: Treatment and Control Groups",
       x = "Propensity Score",
       y = "Count",
       fill = "Catholic School Attendance")

# Density plot
ggplot(prs_df, aes(x = pr_score, fill = factor(catholic))) +
  geom_density(alpha = 0.5) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(title = "Propensity Score Distribution: Treatment and Control Groups",
       x = "Propensity Score",
       y = "Density",
       fill = "Catholic School Attendance") 

# Jittered point plot
ggplot(prs_df, aes(x = pr_score, y = factor(catholic), color = factor(catholic))) +
  geom_jitter() +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(title = "Propensity Score Distribution: Treatment and Control Groups",
       x = "Propensity Score",
       y = "Group",
       color = "Catholic School Attendance") 


Exercise

  • What do these plots tell us?

Non-Exact Matching

MatchIt can generate propensity scores itself, so we don't need to manually go through the process above. Let's try putting together a non-exact matching formula yourself! Try:

  • nearest neighbor matching
  • with replacement
  • with a one-to-one ratio
  • on the match_data dataset
one_match <- MatchIt::matchit(catholic ~ race_white + w3income + p5hmage +
                              p5numpla + w3momed_hsb,
                              method = "nearest", 
                              ratio = 1, 
                              replace = TRUE,
                              data = match_data)

summary(one_match)
## 
## Call:
## MatchIt::matchit(formula = catholic ~ race_white + w3income + 
##     p5hmage + p5numpla + w3momed_hsb, data = match_data, method = "nearest", 
##     ratio = 1, replace = TRUE)
## 
## Summary of balance for all data:
##             Means Treated Means Control SD Control  Mean Diff  eQQ Med
## distance           0.1927        0.1379     0.0845     0.0549 5.67e-02
## race_white         0.7411        0.5914     0.4916     0.1497 0.00e+00
## w3income       82568.9357    55485.0210 43961.0872 27083.9146 2.50e+04
## p5hmage           39.5932       37.5658     6.5506     2.0274 2.00e+00
## p5numpla           1.0917        1.1298     0.3910    -0.0380 0.00e+00
## w3momed_hsb        0.2234        0.4609     0.4985    -0.2375 0.00e+00
##               eQQ Mean  eQQ Max
## distance        0.0548 7.60e-02
## race_white      0.1501 1.00e+00
## w3income    27069.1775 6.25e+04
## p5hmage         2.2544 7.00e+00
## p5numpla        0.0399 2.00e+00
## w3momed_hsb     0.2374 1.00e+00
## 
## 
## Summary of balance for matched data:
##             Means Treated Means Control SD Control Mean Diff eQQ Med  eQQ Mean
## distance           0.1927        0.1927     0.0846    0.0000  0.0042    0.0040
## race_white         0.7411        0.7493     0.4336   -0.0081  0.0000    0.0026
## w3income       82568.9357    81775.6653 46655.1981  793.2703  0.0000 3178.8823
## p5hmage           39.5932       39.6169     5.1757   -0.0237  0.0000    0.1267
## p5numpla           1.0917        1.0777     0.2839    0.0141  0.0000    0.0155
## w3momed_hsb        0.2234        0.2226     0.4162    0.0007  0.0000    0.0095
##              eQQ Max
## distance    1.20e-02
## race_white  1.00e+00
## w3income    6.25e+04
## p5hmage     9.00e+00
## p5numpla    1.00e+00
## w3momed_hsb 1.00e+00
## 
## Percent Balance Improvement:
##             Mean Diff.  eQQ Med eQQ Mean  eQQ Max
## distance       99.9987  92.6038  92.7374  84.1649
## race_white     94.5656   0.0000  98.2776   0.0000
## w3income       97.0711 100.0000  88.2565   0.0000
## p5hmage        98.8326 100.0000  94.3789 -28.5714
## p5numpla       63.0544   0.0000  61.1494  50.0000
## w3momed_hsb    99.6886   0.0000  96.0060   0.0000
## 
## Sample sizes:
##           Control Treated
## All          7915    1352
## Matched      1160    1352
## Unmatched    6755       0
## Discarded       0       0

We can interpret the resulting output as follows:

  • Summary of balance for all data: Comparison of the means for all the data without matching
  • Summary of balance for matched data: Comparison of means for matched data. Looking for them to become similar.
  • Percent balance improvement: Higher is better, close to 100 is ideal.
  • Sample sizes: How many units were matched in the control/treatment groups.

Now, let's plot the propensity scores for the treated and untreated units.

# simple plot - check out the cobalt package for nicer options, or use ggplot2 to create your own!
plot(one_match, type = "hist")

Let's extract the data from one_match and creating a balance table like the one we did before, just this time using the new data. Scroll down for the answer when you're ready.

# grab data set
data_prop_match <- MatchIt::match.data(one_match)

# create list of covariates for the table
list_cov <- c("race_white", "p5hmage", "w3income", "p5numpla", "w3momed_hsb")

data_prop_match %>% # our data frame
  dplyr::summarize_at(list_cov, funs(list(broom::tidy(t.test(. ~ catholic))))) %>% # sequentially run t-tests across all the covariates in the list_cov (note that you have to change the "treatment")
  purrr::map(1) %>% # maps into a list
  dplyr::bind_rows(.id='variables') %>% # binds list into a single data frame and names the id column "variables" 
  dplyr::select(variables, estimate1, estimate2, p.value) %>% # select only the names, group means, and p-values
  dplyr::mutate_if(is.numeric, round, 3) %>% # round numeric variables to three places
  knitr::kable(col.names = c("Variable", "(Catholic = 0)", "(Catholic = 1)", "P value")) %>% # create kable table and remane headings
  kableExtra::kable_styling() # style kable table for our knitted document
Variable (Catholic = 0) (Catholic = 1) P value
race_white 0.743 0.741 0.910
p5hmage 39.521 39.593 0.730
w3income 79313.015 82568.936 0.080
p5numpla 1.078 1.092 0.233
w3momed_hsb 0.233 0.223 0.577

Those means look very close. Hooray.

Finally, let's estimate on the matched data set:

prop_match_model <- lm(c5r2mtsc_std ~ catholic, data = data_prop_match)
stargazer::stargazer(prop_match_model, type = "html")
Dependent variable:
c5r2mtsc_std
catholic -0.164***
(0.037)
Constant 0.374***
(0.027)
Observations 2,512
R2 0.008
Adjusted R2 0.008
Residual Std. Error 0.916 (df = 2510)
F Statistic 20.106*** (df = 1; 2510)
Note: p<0.1; p<0.05; p<0.01

As with the exact matching, we can see that those that did not attend Catholic school performed better on the test than those who did, and the results are statistically significant.

Previous