Using PanelMatch

Overview

The goal of this vignette is to provide a quick overview of using the PanelMatch package, highlight important features and functions, and help users get the most out of their experience with the package. It assumes that you have been able to install the package successfully and are already familiar with the basics of R.

We will be working with the dem data set throughout these examples, which comes included with the package. This is a subset of the data used in Acemoglu et. al’s “Democracy Does Cause Growth” (2019).

Assumptions and Requirements

The package is designed to implement a set of methodological tools that enable researchers to apply matching methods to time-series cross-sectional data, so it is assumed that the data set being used for analysis meets this general structure. Additionally, the treatment variable must be binary, where 0 means “assignment” to control, and 1 to treatment. We also impose the requirements that the variable identifying units in the data are numeric or integer, and that the variable identifying time periods are consecutive numeric/integer data. The data must also be provided in the form of a standard data.frame object (as opposed to a data.table, tibble or other such object). The package will signal an error if these assumptions are violated.

DisplayTreatment

Before doing any matching, estimation or further analysis, it is helpful to understand the distribution of the treatment variable within your data set. The package hopes to facilitate this with the DisplayTreatment function.

library(PanelMatch)
DisplayTreatment(unit.id = "wbcode2",
                 time.id = "year", legend.position = "none",
                 xlab = "year", ylab = "Country Code",
                 treatment = "dem", data = dem)

In the plot, the x axis represents time, and the y axis displays the different units in your data set. Red tiles indicate periods where “treatment” is applied to a given unit and blue tiles indicate “control” periods. White spaces indicate missing data. In the above plot, we have a large number of unique units and time periods, so the axes become hard to read. The DisplayTreatment function uses ggplot2 to create this plot, so any custom styling can be applied easily using ggplot2 conventions, as the function returns a ggplot2 object. However, the function also has some built in options to make cleaning up the plot a little easier. For instance, when the data set is particularly large, it can help to use the dense.plot option. There are many ways to customize these plots. Consult the documentation for the full list and descriptions of the arguments.

DisplayTreatment(unit.id = "wbcode2",
                 time.id = "year", legend.position = "none",
                 xlab = "year", ylab = "Country Code",
                 treatment = "dem", data = dem, 
                 hide.x.axis.label = TRUE, hide.y.axis.label = TRUE) # axis label options

DisplayTreatment(unit.id = "wbcode2",
                 time.id = "year", legend.position = "none",
                 xlab = "year", ylab = "Country Code",
                 treatment = "dem", data = dem, 
                 hide.x.axis.label = TRUE, hide.y.axis.label = TRUE, 
                 dense.plot = TRUE) # setting dense.plot to TRUE

PanelMatch

Next, we will move to the PanelMatch function. The primary purposes of this function are to 1) use the treatment histories of units to create sets matching treated units to control units and 2) refine the matched sets by determining weights for each control unit in a given matched set. These weights are then used in the estimation stage in an intuitive way: units with higher weights factor more heavily into the estimations. There are a number of parameters to the PanelMatch function, some of which will not be discussed in this vignette. Please consult the function documentation for a complete list.

Matching on Treatment History

1 is achieved by matching units that receive treatment after previously being untreated (ie. units that move from control, or 0 in the data, to treatment, or 1 in the data, at a certain time) to control units that have matching treatment histories in a specified time window, while also remaining untreated during the same period that the treated unit receives treatment. For example, if unit 4 is a control unit until the year 1992, when it then receives treatment, then, for a specified lag window of four time periods (specified as lag = 4), it will be matched with control units that share an identical treatment history with unit 4 from 1988-1991, while also remaining control units in 1992.

Setting the Quantity of Interest

This process of matching on treatment history is also affected by the specified quantity of interest, which is specified using the qoi parameter, set to either “att”, “atc”, or “ate”. See Imai, Kim, and Wang (2018) http://web.mit.edu/insong/www/pdf/tscs.pdf for full descriptions. When the qoi is set to “att”, control units have 0 in the treatment variable column, and treated units have 1. The process is flipped when the qoi is set to atc. When ate is specified as the qoi, sets are computed for both the att and atc setups.

Visualize Treatment History Matching

We can use the DisplayTreatment function to help better understand this definition and the example given above. In the code below, we will create matched sets as described in 1, extract one of these sets, and then show the treated unit and matched control units from this set on a plot. We will cover the PanelMatch function in more detail later, so do not worry about understanding all of the arguments yet. For now, know that the lag window is specified using the lag variable and that we are not refining the matched sets in any other way, aside from the description in 1 (refinement.method = “none”).

The visualization lines up with our expectations. We see that the control units and the treated unit have identical treatment histories over the lag window (1988-1991). Unit 4 then receives treatment in 1992 and the other units shown do not.

# Create the matched sets
PM.results.none <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
                         treatment = "dem", refinement.method = "none", 
                         data = dem, match.missing = TRUE, 
                         size.match = 5, qoi = "att", outcome.var = "y",
                         lead = 0:4, forbid.treatment.reversal = FALSE, 
                         use.diagonal.variance.matrix = TRUE)
# Extract the first matched set
mset <- PM.results.none$att[1]

# Use the DisplayTreatment function to visualize the 
# treated unit and matched controls.
DisplayTreatment(unit.id = "wbcode2",
                 time.id = "year", legend.position = "none",
                 xlab = "year", ylab = "Country Code",
                 treatment = "dem", data = dem,
                 matched.set = mset, # this way we highlight the particular set
                 show.set.only = TRUE)
#> Warning: Vectorized input to `element_text()` is not officially supported.
#> ℹ Results may be unexpected or may change in future versions of ggplot2.

Refining Matched Sets

In order to refine the matched sets further and assign weights to control units, the user must specify 1) a method for calculating similarity/distance between units and 2) the variables to be used in similarity/distance calculations to compare units. Other parameters do affect the refinement process, and the user should consult the documentation for a more complete list and set of descriptions.

Select a Refinement Method

Users must choose a refinement method by setting the refinement.method argument to one of “mahalanobis”, “ps.match”, “CBPS.match”, “ps.weight”, “CBPS.weight”, “ps.msm.weight”, “CBPS.msm.weight”, or “none”. The “matching” (any method that has “match” in the name) and mahalanobis refinement methods will assign equal weights to the size.match most similar control units in a matched set. The “weighting” methods (methods with “weight” in the name) will generate weights in such a way that control units more similar to treated units will be assigned higher weights. Please consult the documentation and/or Imai, Kim, and Wang (2018) http://web.mit.edu/insong/www/pdf/tscs.pdf for descriptions of each refinement method.

Variable Selection

Users must also define which covariates should be used in this process for defining similarity between units. This is set using the covs.formula argument, which takes the form of a one sided formula object. The variables defined on the right hand side of the formula are the variables used in these calculations. Users can included “lagged” versions of variables using I(lag(name.of.var, 0:n)).

We will now turn to understanding how to work with the PanelMatch function and the objects it returns to execute and tune these processes.

Understanding PanelMatch and matched.set objects

The PanelMatch function returns a PanelMatch object. These objects contain a number of elements, the most important of which is the matched.set object. Within the PanelMatch object, the attached matched.set object is always named either att or atc. It should be noted that when qoi = ate, then there are two matched.set objects included in the results of the PanelMatch call. Specifically, there will be two matched sets, named att and atc, respectively.

In implementation, the matched.set is just a named list with some added attributes (lag, names of treatment, unit, and time variables) and a structured name scheme. Each entry in the list corresponds to a matched set of treated and control units. Specifically, the names of the elements correspond to the unit and time identifiers of the treated units. This naming scheme is structured, using the pattern [id varable].[time variable]. Each element in the list is a vector indicating the control units (as a vector of the unit ids) that are matched with the treated unit specified in the name of that element.

Recall earlier we used a very basic PanelMatch call that did no unit refinement other than matching units on treatment history. We create an additional PanelMatch call, specifying that we want to use Mahalanobis distance and lagged versions of the tradewb and y variables to calculate the distances between units. Because this is a matching method, we will only give weights to the size.match most similar control units to each treated unit as determined by the distance calculations.

#Call PanelMatch without any refinement
PM.results.none <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
                         treatment = "dem", refinement.method = "none", 
                         data = dem, match.missing = TRUE, 
                         size.match = 5, qoi = "att", outcome.var = "y",
                         lead = 0:4, forbid.treatment.reversal = FALSE, 
                         use.diagonal.variance.matrix = TRUE)
#Extract the matched.set object
msets.none <- PM.results.none$att

#Call PanelMatch with refinement
PM.results.maha <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
                         treatment = "dem", refinement.method = "mahalanobis", 
                         # use Mahalanobis distance 
                         data = dem, match.missing = TRUE, 
                         covs.formula = ~ tradewb, 
                         size.match = 5, qoi = "att" , outcome.var = "y",
                         lead = 0:4, forbid.treatment.reversal = FALSE, 
                         use.diagonal.variance.matrix = TRUE)

msets.maha <- PM.results.maha$att

Examining matched.set objects

Control Unit Weights

Each matched set will have an attribute named “weights.” This vector indicates the weights calculated for each control unit in the refinement process. Each of the possible refinement methods behave differently. We continue with our running examples to observe this. When the refinement.method is set to none, all control units will receive equal weights. Using the Mahalanobis distance refinement method will identify the size.match most similar units to each treated unit. These units then receive a weight, while all other control units receive a weight of zero. We can use R’s standard interface for working with object attributes to examine the weights. The non-zero weights assigned to control units will always sum to one in a given matched set.

# Examine the weights assinged to control units in first matched set
attr(msets.none[[1]], "weights")
#>          3         13         16         19         28         29         31 
#> 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 
#>         35         36         37         43         45         47         51 
#> 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 
#>         53         57         62         64         65         67         70 
#> 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 
#>         71         81         84         87         93         95         96 
#> 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 
#>         97        103        104        105        109        110        112 
#> 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 
#>        114        115        116        118        123        124        125 
#> 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 
#>        128        129        134        140        142        150        155 
#> 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 
#>        156        157        159        161        163        168        171 
#> 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 
#>        172        173        175        176        178        179        180 
#> 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 
#>        182        184        186        187        190        193        196 
#> 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 0.01351351 
#>        197        199        200        202 
#> 0.01351351 0.01351351 0.01351351 0.01351351
attr(msets.maha[[1]], "weights")
#>   3  13  16  19  28  29  31  35  36  37  43  45  47  51  53  57  62  64  65  67 
#> 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
#>  70  71  81  84  87  93  95  96  97 103 104 105 109 110 112 114 115 116 118 123 
#> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
#> 124 125 128 129 134 140 142 150 155 156 157 159 161 163 168 171 172 173 175 176 
#> 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
#> 178 179 180 182 184 186 187 190 193 196 197 199 200 202 
#> 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2 0.2 0.0 0.0

Comparing Methods of Refinement

While we have mostly looked at a simple case of refinement using Mahalanobis distance so far, there are a number of ways to tune the refinement process. There are no hard and fast rules for determining the best configuration. Instead, users should use their substantive knowledge to experiment with and evaluate a number of different setups. In general, users should consider 1) the number of matched sets 2) the number of controls matched to each treated unit and 3) covariate balance as they configure PanelMatch. Having a large number of small matched sets will create larger standard errors in the estimation stage. Poorly balanced covariates suggest undesirable comparisons between treated and control units. Users ought to consider the method of refinement, the variables used for calculating weights, the size of the lag window, the desired procedures for handling missing data (See the match.missing and listwise.delete arguments in the PanelMatch function documentation), and the maximum size of matched sets (when using a matching method). The package aims to provide a number of featres to help users with this process. First, the print, plot, and summary methods for matched.set objects will help users study 1 and 2. To help evaluate covariate balance, users can take advantage of the get_covariate_balance function.

Using get_covariate_balance

The following code runs a number of different PanelMatch refinement configurations and calculates the covariate balance for each setup. Lower values in the covariate balance calculations are desirable. See the documentation of get_covariate_balance for more detailed information.

PM.results.none <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
                         treatment = "dem", refinement.method = "none", 
                         data = dem, match.missing = TRUE, 
                         size.match = 5, qoi = "att", outcome.var = "y",
                         lead = 0:4, forbid.treatment.reversal = FALSE, 
                         use.diagonal.variance.matrix = TRUE)


PM.results.maha <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
                         treatment = "dem", refinement.method = "mahalanobis", 
                         data = dem, match.missing = TRUE, 
                         covs.formula = ~ I(lag(tradewb, 1:4)) + I(lag(y, 1:4)), 
                         size.match = 5, qoi = "att", outcome.var = "y",
                         lead = 0:4, forbid.treatment.reversal = FALSE, 
                         use.diagonal.variance.matrix = TRUE)

# listwise deletion used for missing data
PM.results.listwise <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
                         treatment = "dem", refinement.method = "mahalanobis", 
                         data = dem, match.missing = FALSE, listwise.delete = TRUE, 
                         covs.formula = ~ I(lag(tradewb, 1:4)) + I(lag(y, 1:4)), 
                         size.match = 5, qoi = "att", outcome.var = "y",
                         lead = 0:4, forbid.treatment.reversal = FALSE, 
                         use.diagonal.variance.matrix = TRUE)

# propensity score based weighting method
PM.results.ps.weight <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
                         treatment = "dem", refinement.method = "ps.weight", 
                         data = dem, match.missing = FALSE, listwise.delete = TRUE, 
                         covs.formula = ~ I(lag(tradewb, 1:4)) + I(lag(y, 1:4)), 
                         size.match = 5, qoi = "att", outcome.var = "y",
                         lead = 0:4, forbid.treatment.reversal = FALSE)

get_covariate_balance(PM.results.none$att,
                      data = dem,
                      covariates = c("tradewb", "y"),
                      plot = FALSE)
#>         tradewb            y
#> t_4 -0.07245466  0.291871990
#> t_3 -0.20930129  0.208654876
#> t_2 -0.24425207  0.107736647
#> t_1 -0.10806125 -0.004950238
#> t_0 -0.09493854 -0.015198483

get_covariate_balance(PM.results.maha$att,
                      data = dem,
                      covariates = c("tradewb", "y"),
                      plot = FALSE)
#>         tradewb          y
#> t_4  0.04558637 0.09701606
#> t_3 -0.03312750 0.10844046
#> t_2 -0.01396793 0.08890753
#> t_1  0.10474894 0.06618865
#> t_0  0.15885415 0.05691437

get_covariate_balance(PM.results.listwise$att,
                      data = dem,
                      covariates = c("tradewb", "y"), 
                      plot = FALSE)
#>         tradewb          y
#> t_4  0.05634922 0.05223623
#> t_3 -0.01104797 0.05217896
#> t_2  0.01411473 0.03094133
#> t_1  0.06850180 0.02092209
#> t_0  0.05044958 0.01943728

get_covariate_balance(PM.results.ps.weight$att,
                      data = dem,
                      covariates = c("tradewb", "y"), 
                      plot = FALSE)
#>         tradewb          y
#> t_4 0.014362590 0.04035905
#> t_3 0.005529734 0.04188731
#> t_2 0.009410044 0.04195008
#> t_1 0.027907540 0.03975173
#> t_0 0.040272235 0.04167921

The get_covariate_balance function has a few useful options. Users can generate plots showing the covariate balance by setting plot = TRUE. These plots can be customized using the same arguments one would normally pass to the base R plot method. Users can also set use.equal.weights = TRUE to easily get the balance of unrefined sets. This makes it easy to understand the impact of refinement.

# Use equal weights
get_covariate_balance(PM.results.ps.weight$att,
                      data = dem,
                      use.equal.weights = TRUE,
                      covariates = c("tradewb", "y"), 
                      plot = TRUE, 
                      # visualize by setting plot to TRUE
                      ylim = c(-1, 1))


# Compare covariate balance to refined sets
get_covariate_balance(PM.results.ps.weight$att,
                      data = dem,
                      covariates = c("tradewb", "y"), 
                      plot = TRUE, 
                      # visualize by setting plot to TRUE
                      ylim = c(-1, 1))

We can also evaluate our results using the balance_scatter function:

balance_scatter(non_refined_set = PM.results.none$att,
               refined_list = list(PM.results.maha$att, PM.results.ps.weight$att),
               data = dem,
               covariates = c("y", "tradewb"))

PanelEstimate

We now move to the other major part of the package: obtaining point estimates and standard errors using PanelEstimate. The package uses bootstrapping to obtain standard errors. By default, 1000 bootstrap iterations are used, and .95 is the default confidence level.

The function returns a PanelEstimate object, which behaves like a list, much like the other objects in the package. As such, you can access the various elements just as you would in a list.

PE.results <- PanelEstimate(sets = PM.results.ps.weight, data = dem) 

# View the point estimates 
PE.results[["estimates"]]
#>       t+0       t+1       t+2       t+3       t+4 
#> 0.2609565 0.9630847 1.2851017 1.7370930 1.4871846

PanelEstimate Methods

PanelEsimate objects have custom summary and plot methods defined. The plot method can be customized with all of the same arguments/operations as the regular plot function in base R. This method shows the point estimates for the specified lead window periods, along with the standard errors.

summary(PE.results)
#> Weighted Difference-in-Differences with Propensity Score
#> Matches created with 4 lags
#> 
#> Standard errors computed with 1000 Weighted bootstrap samples
#> 
#> Estimate of Average Treatment Effect on the Treated (ATT) by Period:
#> $summary
#>      estimate std.error      2.5%    97.5%
#> t+0 0.2609565 0.6366316 -1.005048 1.404036
#> t+1 0.9630847 1.0140831 -1.265015 2.844695
#> t+2 1.2851017 1.3767780 -1.679806 3.795274
#> t+3 1.7370930 1.7350874 -1.840462 5.127225
#> t+4 1.4871846 2.1188012 -2.970983 5.559820
#> 
#> $lag
#> [1] 4
#> 
#> $iterations
#> [1] 1000
#> 
#> $qoi
#> [1] "att"

plot(PE.results)

This last plot makes it clear that the effect of treatment on treated units (att) in this configuration is statistically insignificant.

Moderating Variables

The PanelEstimate function can handle moderating variables. When a moderating variable is specified in the PanelEstimate function, a list of PanelEstimate objects are returned, each with the structure of a standard PanelEstimate object. There will be a separate PanelEstimate object in the list for each value of the moderating variable, and the name of each element in the returned list will reflect this value. The moderating variable is assumed to be categorical.

# add simple moderating variable
dem$moderator <- 0
dem$moderator <- ifelse(dem$wbcode2 > 100, 1, 2)


PM.results <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
                         treatment = "dem", refinement.method = "mahalanobis", 
                         data = dem, match.missing = TRUE, 
                         covs.formula = ~ I(lag(tradewb, 1:4)) + I(lag(y, 1:4)), # lags
                         size.match = 5, qoi = "att", outcome.var = "y",
                         lead = 0:4, forbid.treatment.reversal = FALSE, 
                         use.diagonal.variance.matrix = TRUE)

PE.results <- PanelEstimate(sets = PM.results, data = dem, moderator = "moderator")

# Names correspond to 
# moderator values
names(PE.results)
#> [1] "X2" "X1"

# Extract each result
plot(PE.results[[1]])

plot(PE.results[[2]])