Matching in R
Welcome
Introduction!
Welcome to our fifth tutorial for the Statistics II: Statistical Modeling & Causal Inference (with R) course.
During this week's lecture you reviewed randomization in experimental setups. You also learned how matching can be leveraged to gather causal estimates.
In this lab session we will:
- Take a step back to review how to compare the means of two groups in R
- Learn how to perform matching with the
MatchIt
package - Illustrate the mechanics of propensity score matching with
gml()
Packages
# These are the libraries we will use today. Make sure to install them in your console in case you have not done so previously.
library(tidyverse) # To use dplyr functions and the pipe operator when needed
library(ggplot2) # To create plots (this package is also loaded by library(tidyverse))
library(purrr) # To repeat code across our list in the balance table purrr::map() (this package is also loaded by library(tidyverse))
library(broom) # To format regression output
library(stargazer) # To format model output
library(knitr) # To create HTML tables with kable()
library(kableExtra) # To format the HTML tables
library(MatchIt) # To match
Our data
Today we will work with data from the Early Childhood Longitudinal Study (ECLS).
ecls_df <-
readr::read_csv("https://raw.githubusercontent.com/seramirezruiz/stats-ii-lab/master/Session%204/data/ecls.csv") %>%
dplyr::select(-childid, -race, -w3daded,
-w3momed, -w3inccat) #drop these columns (-)
names(ecls_df) #checking the names of the variables in the data frame
## [1] "catholic" "race_white" "race_black" "race_hispanic"
## [5] "race_asian" "p5numpla" "p5hmage" "p5hdage"
## [9] "w3daded_hsb" "w3momed_hsb" "w3momscr" "w3dadscr"
## [13] "w3income" "w3povrty" "p5fstamp" "c5r2mtsc"
## [17] "c5r2mtsc_std"
(Example inspired by Simon Ejdemyr: https://sejdemyr.github.io/r-tutorials/statistics/tutorial8.html)
Reference links:
MatchIt
: https://cran.r-project.org/web/packages/MatchIt/vignettes/matchit.pdfCobalt
: (optional library for matching plots and extra features): https://cran.r-project.org/web/packages/cobalt/vignettes/cobalt_A0_basic_use.htmlkableExtra
: (for formatting tables): https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.htmlStargazer
: (for formatting model output): https://www.jakeruss.com/cheatsheets/stargazer/- Video overview of matching concepts: https://fr.coursera.org/lecture/crash-course-in-causality/overview-of-matching-JQfPC
Comparing Differences in Means and Balance Tables
As you may have seen in this week's application paper, balance tables feature prominently in work that utilizes matching.
In binary treatment setups, balance tables present an overview of the difference in means for the groups accross covariates.
Let's take the dataset as an example to review how to compare differences in means and build balance tables in R. There are multiple ways to explore these kinds of questions. In this lab we will leverage t-tests to check the statistical significance of the difference in means.
In other words, we want to know whether the observed differences in average value of a variable between the two groups or samples can be due to random chance (our null hypothesis).
T-tests in R
In this case, let's imagine we want to check the statistical significance of the differences in the composition of the catholic and public school samples in the w3povrty
(under the poverty line) variable.
The general syntax for a t-test is simply as follows. The vectors refer to the variables whose mean you want to compare.
t.test(y ~ x, data = your_data)
Exercise
t.test(w3povrty ~ catholic, data = ecls_df)
##
## Welch Two Sample t-test
##
## data: w3povrty by catholic
## t = 26.495, df = 4034.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1678107 0.1946307
## sample estimates:
## mean in group 0 mean in group 1
## 0.21918633 0.03796562
- What does the group 0 mean tell us?
- What does the group 1 mean tell us?
- Is the difference between catholic school and public school students statistically significant?
Balance tables in R
Now let's take a look at how can we create a simple and good looking balance table. The idea here is to compare the mean values of different variables across populations or groups. In our case, let's see whether the catholic and public school groups differ across the covariates in the dataset:
Here is one way to create the table (you can adapt this code for the assignment).
# create a list with the covariates
list_cov <- c("race_white", "race_black", "race_hispanic", "race_asian", "p5numpla",
"p5hmage", "p5hdage", "w3daded_hsb", "w3momed_hsb", "w3momscr", "w3dadscr",
"w3income", "w3povrty", "p5fstamp", "c5r2mtsc", "c5r2mtsc_std")
ecls_df %>% # our data frame
dplyr::summarize_at(list_cov, funs(list(broom::tidy(t.test(. ~ catholic))))) %>% # sequentially run t-tests across all the covariates in the list_cov (note that you have to change the "treatment")
purrr::map(1) %>% # maps into a list
dplyr::bind_rows(.id='variables') %>% # binds list into a single data frame and names the id column "variables"
dplyr::select(variables, estimate1, estimate2, p.value) %>% # select only the names, group means, and p-values
dplyr::mutate_if(is.numeric, round, 3) %>% # round numeric variables to three places
knitr::kable(col.names = c("Variable", "(Catholic = 0)", "(Catholic = 1)", "P value")) %>% # create kable table and remane headings
kableExtra::kable_styling() # style kable table for our knitted document
Variable | (Catholic = 0) | (Catholic = 1) | P value |
---|---|---|---|
race_white | 0.556 | 0.725 | 0.000 |
race_black | 0.136 | 0.054 | 0.000 |
race_hispanic | 0.181 | 0.131 | 0.000 |
race_asian | 0.066 | 0.052 | 0.025 |
p5numpla | 1.133 | 1.093 | 0.000 |
p5hmage | 37.561 | 39.575 | 0.000 |
p5hdage | 40.392 | 41.988 | 0.000 |
w3daded_hsb | 0.488 | 0.259 | 0.000 |
w3momed_hsb | 0.464 | 0.227 | 0.000 |
w3momscr | 43.114 | 47.492 | 0.000 |
w3dadscr | 42.713 | 46.356 | 0.000 |
w3income | 54889.159 | 82074.301 | 0.000 |
w3povrty | 0.219 | 0.038 | 0.000 |
p5fstamp | 0.129 | 0.015 | 0.000 |
c5r2mtsc | 50.209 | 52.389 | 0.000 |
c5r2mtsc_std | -0.031 | 0.194 | 0.000 |
Exercise
- Are the differences in means significant at conventional levels?
- What differences strike you from the composition of the two samples?
The Effect of Catholic School on Student Achievement
In this tutorial we’ll analyze the effect of going to Catholic school, as opposed to public school, on student achievement. Because students who attend Catholic school on average are different from students who attend public school, we will use matching to get more credible causal estimates of Catholic schooling.
Exploration of the data
ecls_df %>%
dplyr::group_by(catholic) %>%
dplyr::summarize(n_students = n(),
avg_math = mean(c5r2mtsc_std),
std_error = sd(c5r2mtsc_std) / sqrt(n_students)) %>%
round(3) %>% # round the results
knitr::kable() %>% # create kable table
kableExtra::kable_styling() # view kable table
catholic | n_students | avg_math | std_error |
---|---|---|---|
0 | 9568 | -0.031 | 0.010 |
1 | 1510 | 0.194 | 0.022 |
We can see that we have many more students that did not attend Catholic school than those who did. Also, the Catholic school students have a higher average math score.
Naive Average Treatment Effect (NATE)
We can naively compare the students on their standardized math scores (c5r2mtsc_std). As you remember, the Naive Average Treatment Effect (NATE) is the difference in the means of the observed outcomes of the two groups:
\[NATE = E(y^1 | D = 1) - E(y^0 | D = 0)\]
In this case, the NATE would be difference between the average math scores.
NATE manually
Do you remember what we did in the last section?
We could substract the avg_math
for the catholic = 1 and the avg_math
for the catholic = 0
\[NATE = 0.194 - (-0.031)\] \[NATE = 0.194 + 0.031 = 0.225\]
NATE with t.test()
t.test(c5r2mtsc_std ~ catholic, data = ecls_df)
##
## Welch Two Sample t-test
##
## data: c5r2mtsc_std by catholic
## t = -9.1069, df = 2214.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2727988 -0.1761292
## sample estimates:
## mean in group 0 mean in group 1
## -0.03059583 0.19386817
NATE with lm()
lm(c5r2mtsc_std ~ catholic, data = ecls_df) %>%
stargazer::stargazer(.,type = "html")
Dependent variable: | |
c5r2mtsc_std | |
catholic | 0.224*** |
(0.028) | |
Constant | -0.031*** |
(0.010) | |
Observations | 11,078 |
R2 | 0.006 |
Adjusted R2 | 0.006 |
Residual Std. Error | 0.997 (df = 11076) |
F Statistic | 66.096*** (df = 1; 11076) |
Note: | p<0.1; p<0.05; p<0.01 |
Exercise
- What parallels do you find between substracting the manually extracted means, the t-test, and the linear regression?
Matching with MatchIt
MatchIt
is designed for causal inference with a dichotomous treatment variable and a set of pretreatment control variables. Any number or type of dependent variables can be used.
We have three steps:
- Perform the match with
MatchIt::matchit()
- Create a new data frame with the matched data with
MatchIt::match.data()
- Model
The basic syntax is as follows:
match_process <- MatchIt::matchit(treatment ~ x1 + x2, data = mydata) # NOTE. We include treatment ~ covariates
matched_df <- MatchIt::match.data(match_process)
matched_model <- lm(outcome ~ trearment, data = matched_df)
where treat is the dichotomous treatment variable, and x1 and x2 are pre-treatment co-variates, all of which are contained in the data frame mydata.
MatchIt
is capable of using several matching methods:
Exact (method = "exact"): The simplest version of matching is exact. This technique matches each treated unit to all possible control units with exactly the same values on all the covariates, forming subclasses such that within each subclass all units (treatment and control) have the same covariate values.
Subclassification (method = "subclass"): When there are many covariates (or some covariates can take a large number of values), finding sufficient exact matches will often be impossible. The goal of subclassification is to form subclasses, such that in each of them the distribution (rather than the exact values) of covariates for the treated and control groups are as similar as possible.
Nearest Neighbor (method = "nearest"): Nearest neighbor matching selects the best control matches for each individual in the treatment group. Matching is done using a distance measure (propensity score) specified by the distance option (default = logit).
As well as optimal matching, full matching, genetic matching, and coarsened exact matching, all of which are detailed in the documentation.
A few additional arguments are important to know about:
distance: this refers to propensity scores. There are many options for how to calculate these within MatchIt.
discard: specifies whether to discard units that fall outside some measure of support of the distance measure (default is "none", discard no units). For example, if some treated units have extremely high propensity scores that are higher than any control units, we could drop those.
replace: a logical value indicating whether each control unit can be matched to more than one treated unit (default is replace = FALSE, each control unit is used at most once).
ratio: the number of control units to match to each treated unit (default is ratio = 1).
There are also some optional arguments for most of the matching methods, which you can read about in the documentation if you are interested.
Exact Matching
We can use a combination of the results from our balance table and theory to identify which variables to use for matching. Let's perform an exact match with:
- race_white: Is the student white (1) or not (0)?
- p5hmage: Mother’s age
- w3income: Family income
- p5numpla: Number of places the student has lived for at least 4 months
- w3momed_hsb: Is the mother’s education level high-school or below (1) or some college or more (0)?
# first we must omit missing values (MatchIt does not allow missings)
match_data <- ecls_df %>%
dplyr::select(catholic, c5r2mtsc_std, race_white, p5hmage,
w3income, p5numpla, w3momed_hsb) %>%
na.omit()
# perform exact match
exact_match <- MatchIt::matchit(catholic ~ race_white + p5hmage + w3income +
p5numpla + w3momed_hsb,
method = "exact",
data = match_data)
# Try seeing the output in the console with summary(exact_match)
# grab the matched data into a new data frame
data_exact_match <- MatchIt::match.data(exact_match)
# estimate effect again with new data frame
exact_match_model <- lm(c5r2mtsc_std ~ catholic, data = data_exact_match)
stargazer::stargazer(exact_match_model, type = "html")
Dependent variable: | |
c5r2mtsc_std | |
catholic | -0.101*** |
(0.030) | |
Constant | 0.319*** |
(0.015) | |
Observations | 5,405 |
R2 | 0.002 |
Adjusted R2 | 0.002 |
Residual Std. Error | 0.934 (df = 5403) |
F Statistic | 11.340*** (df = 1; 5403) |
Note: | p<0.1; p<0.05; p<0.01 |
Now we can see that the mean in the group that did not attend Catholic school is actually about 0.10 higher than the mean for those who did. The results are statistically significant given that the confidence interval does not contain zero, and we have a fairly small p-value.
Propensity Scores
If we want to perform non-exact matching, we can use propensity scores. We can generate these manually using a logit model on the unmatched data set.
When we extract propensity scores, we model the propensity of each unit of falling under the treatment group based on their values on a set of covariates. This is how we would do this manually:
# create a new column with income by the thousands for more interpretable output
ecls_df <- ecls_df %>%
dplyr::mutate(w3income_1k = w3income / 1000)
# estimate logit model
m_ps <- glm(catholic ~ race_white + w3income_1k +
p5hmage + p5numpla + w3momed_hsb,
family = binomial(link = "logit"), # you can also use a probit link here
data = ecls_df)
# extract predicted probabilities
# type = "response" option tells R to output probabilities of the form P(Y = 1|X)
prs_df <- dplyr::tibble(pr_score = predict(m_ps, type = "response"),
catholic = m_ps$model$catholic) # the actual values
Let's plot the propensity scores by treatment group to explore common support:
# Histogram
ggplot(prs_df, aes(x = pr_score, fill = factor(catholic))) +
geom_histogram(alpha = 0.5) +
theme_minimal() +
theme(legend.position = "bottom") +
labs(title = "Propensity Score Distribution: Treatment and Control Groups",
x = "Propensity Score",
y = "Count",
fill = "Catholic School Attendance")
# Density plot
ggplot(prs_df, aes(x = pr_score, fill = factor(catholic))) +
geom_density(alpha = 0.5) +
theme_minimal() +
theme(legend.position = "bottom") +
labs(title = "Propensity Score Distribution: Treatment and Control Groups",
x = "Propensity Score",
y = "Density",
fill = "Catholic School Attendance")
# Jittered point plot
ggplot(prs_df, aes(x = pr_score, y = factor(catholic), color = factor(catholic))) +
geom_jitter() +
theme_minimal() +
theme(legend.position = "bottom") +
labs(title = "Propensity Score Distribution: Treatment and Control Groups",
x = "Propensity Score",
y = "Group",
color = "Catholic School Attendance")
Exercise
- What do these plots tell us?
Non-Exact Matching
MatchIt
can generate propensity scores itself, so we don't need to manually go through the process above. Let's try putting together a non-exact matching formula yourself! Try:
- nearest neighbor matching
- with replacement
- with a one-to-one ratio
- on the match_data dataset
one_match <- MatchIt::matchit(catholic ~ race_white + w3income + p5hmage +
p5numpla + w3momed_hsb,
method = "nearest",
ratio = 1,
replace = TRUE,
data = match_data)
summary(one_match)
##
## Call:
## MatchIt::matchit(formula = catholic ~ race_white + w3income +
## p5hmage + p5numpla + w3momed_hsb, data = match_data, method = "nearest",
## ratio = 1, replace = TRUE)
##
## Summary of balance for all data:
## Means Treated Means Control SD Control Mean Diff eQQ Med
## distance 0.1927 0.1379 0.0845 0.0549 5.67e-02
## race_white 0.7411 0.5914 0.4916 0.1497 0.00e+00
## w3income 82568.9357 55485.0210 43961.0872 27083.9146 2.50e+04
## p5hmage 39.5932 37.5658 6.5506 2.0274 2.00e+00
## p5numpla 1.0917 1.1298 0.3910 -0.0380 0.00e+00
## w3momed_hsb 0.2234 0.4609 0.4985 -0.2375 0.00e+00
## eQQ Mean eQQ Max
## distance 0.0548 7.60e-02
## race_white 0.1501 1.00e+00
## w3income 27069.1775 6.25e+04
## p5hmage 2.2544 7.00e+00
## p5numpla 0.0399 2.00e+00
## w3momed_hsb 0.2374 1.00e+00
##
##
## Summary of balance for matched data:
## Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean
## distance 0.1927 0.1927 0.0846 0.0000 0.0042 0.0040
## race_white 0.7411 0.7493 0.4336 -0.0081 0.0000 0.0026
## w3income 82568.9357 81775.6653 46655.1981 793.2703 0.0000 3178.8823
## p5hmage 39.5932 39.6169 5.1757 -0.0237 0.0000 0.1267
## p5numpla 1.0917 1.0777 0.2839 0.0141 0.0000 0.0155
## w3momed_hsb 0.2234 0.2226 0.4162 0.0007 0.0000 0.0095
## eQQ Max
## distance 1.20e-02
## race_white 1.00e+00
## w3income 6.25e+04
## p5hmage 9.00e+00
## p5numpla 1.00e+00
## w3momed_hsb 1.00e+00
##
## Percent Balance Improvement:
## Mean Diff. eQQ Med eQQ Mean eQQ Max
## distance 99.9987 92.6038 92.7374 84.1649
## race_white 94.5656 0.0000 98.2776 0.0000
## w3income 97.0711 100.0000 88.2565 0.0000
## p5hmage 98.8326 100.0000 94.3789 -28.5714
## p5numpla 63.0544 0.0000 61.1494 50.0000
## w3momed_hsb 99.6886 0.0000 96.0060 0.0000
##
## Sample sizes:
## Control Treated
## All 7915 1352
## Matched 1160 1352
## Unmatched 6755 0
## Discarded 0 0
We can interpret the resulting output as follows:
- Summary of balance for all data: Comparison of the means for all the data without matching
- Summary of balance for matched data: Comparison of means for matched data. Looking for them to become similar.
- Percent balance improvement: Higher is better, close to 100 is ideal.
- Sample sizes: How many units were matched in the control/treatment groups.
Now, let's plot the propensity scores for the treated and untreated units.
# simple plot - check out the cobalt package for nicer options, or use ggplot2 to create your own!
plot(one_match, type = "hist")
Let's extract the data from one_match and creating a balance table like the one we did before, just this time using the new data. Scroll down for the answer when you're ready.
# grab data set
data_prop_match <- MatchIt::match.data(one_match)
# create list of covariates for the table
list_cov <- c("race_white", "p5hmage", "w3income", "p5numpla", "w3momed_hsb")
data_prop_match %>% # our data frame
dplyr::summarize_at(list_cov, funs(list(broom::tidy(t.test(. ~ catholic))))) %>% # sequentially run t-tests across all the covariates in the list_cov (note that you have to change the "treatment")
purrr::map(1) %>% # maps into a list
dplyr::bind_rows(.id='variables') %>% # binds list into a single data frame and names the id column "variables"
dplyr::select(variables, estimate1, estimate2, p.value) %>% # select only the names, group means, and p-values
dplyr::mutate_if(is.numeric, round, 3) %>% # round numeric variables to three places
knitr::kable(col.names = c("Variable", "(Catholic = 0)", "(Catholic = 1)", "P value")) %>% # create kable table and remane headings
kableExtra::kable_styling() # style kable table for our knitted document
Variable | (Catholic = 0) | (Catholic = 1) | P value |
---|---|---|---|
race_white | 0.743 | 0.741 | 0.910 |
p5hmage | 39.521 | 39.593 | 0.730 |
w3income | 79313.015 | 82568.936 | 0.080 |
p5numpla | 1.078 | 1.092 | 0.233 |
w3momed_hsb | 0.233 | 0.223 | 0.577 |
Those means look very close. Hooray.
Finally, let's estimate on the matched data set:
prop_match_model <- lm(c5r2mtsc_std ~ catholic, data = data_prop_match)
stargazer::stargazer(prop_match_model, type = "html")
Dependent variable: | |
c5r2mtsc_std | |
catholic | -0.164*** |
(0.037) | |
Constant | 0.374*** |
(0.027) | |
Observations | 2,512 |
R2 | 0.008 |
Adjusted R2 | 0.008 |
Residual Std. Error | 0.916 (df = 2510) |
F Statistic | 20.106*** (df = 1; 2510) |
Note: | p<0.1; p<0.05; p<0.01 |
As with the exact matching, we can see that those that did not attend Catholic school performed better on the test than those who did, and the results are statistically significant.