---
title: "Using PanelMatch"
author: "In Song Kim, Adam Rauh, Erik Wang, Kosuke Imai"
date: "`r Sys.Date()`"
output:
  html_document:
    df_print: paged
  pdf_document:
    number_sections: no
  fig_width: 6 
  fig_height: 4 
vignette: |
  %\VignetteIndexEntry{Using PanelMatch} 
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---
# Overview
The goal of this vignette is to provide a quick overview of using the `PanelMatch` package, highlight important features and functions, and help users get the most out of their experience with the package. It assumes that you have been able to install the package successfully and are already familiar with the basics of R.

We will be working with the `dem` data set throughout these examples, which comes included with the package. This is a subset of the data used in Acemoglu et. al's "Democracy Does Cause Growth" (2019).

## Assumptions and Requirements
The package is designed to implement a set of methodological tools that enable researchers to apply matching methods to time-series cross-sectional data, so it is assumed that the data set being used for analysis meets this general structure. Additionally, the treatment variable must be binary, where 0 means "assignment" to control, and 1 to treatment. We also impose the requirements that the variable identifying units in the data are numeric or integer, and that the variable identifying time periods are consecutive numeric/integer data. The data must also be provided in the form of a standard `data.frame` object (as opposed to a `data.table`, `tibble` or other such object). The package will signal an error if these assumptions are violated.


```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# DisplayTreatment

Before doing any matching, estimation or further analysis, it is helpful to understand the distribution of the treatment variable within your data set. The package hopes to facilitate this with the `DisplayTreatment` function. 

```{r}
library(PanelMatch)
DisplayTreatment(unit.id = "wbcode2",
                 time.id = "year", legend.position = "none",
                 xlab = "year", ylab = "Country Code",
                 treatment = "dem", data = dem)
```

In the plot, the x axis represents time, and the y axis displays the different units in your data set. Red tiles indicate periods where "treatment" is applied to a given unit and blue tiles indicate "control" periods. White spaces indicate missing data. In the above plot, we have a large number of unique units and time periods, so the axes become hard to read. The `DisplayTreatment` function uses `ggplot2` to create this plot, so any custom styling can be applied easily using `ggplot2` conventions, as the function returns a `ggplot2` object. However, the function also has some built in options to make cleaning up the plot a little easier. For instance, when the data set is particularly large, it can help to use the `dense.plot` option. There are many ways to customize these plots. Consult the documentation for the full list and descriptions of the arguments. 


```{r}
DisplayTreatment(unit.id = "wbcode2",
                 time.id = "year", legend.position = "none",
                 xlab = "year", ylab = "Country Code",
                 treatment = "dem", data = dem, 
                 hide.x.axis.label = TRUE, hide.y.axis.label = TRUE) # axis label options
```

```{r}
DisplayTreatment(unit.id = "wbcode2",
                 time.id = "year", legend.position = "none",
                 xlab = "year", ylab = "Country Code",
                 treatment = "dem", data = dem, 
                 hide.x.axis.label = TRUE, hide.y.axis.label = TRUE, 
                 dense.plot = TRUE) # setting dense.plot to TRUE
```

# PanelMatch

Next, we will move to the `PanelMatch` function. The primary purposes of this function are to 1) use the treatment histories of units to create sets matching treated units to control units and 2) refine the matched sets by determining weights for each control unit in a given matched set. These weights are then used in the estimation stage in an intuitive way: units with higher weights factor more heavily into the estimations. There are a number of parameters to the `PanelMatch` function, some of which will not be discussed in this vignette. Please consult the function documentation for a complete list. 

### Matching on Treatment History

**1** is achieved by matching units that receive treatment after previously being untreated (ie. units that move from `control`, or 0 in the data, to `treatment`, or 1 in the data, at a certain time) to control units that have matching treatment histories in a specified time window, while also remaining untreated during the same period that the treated unit receives treatment. For example, if unit 4 is a control unit until the year 1992, when it then receives treatment, then, for a specified lag window of four time periods (specified as `lag = 4`), it will be matched with control units that share an identical treatment history with unit 4 from 1988-1991, while also *remaining control units* in 1992. 

#### Setting the Quantity of Interest

This process of matching on treatment history is also affected by the specified quantity of interest, which is specified using the `qoi` parameter, set to either "att", "atc", or "ate". See Imai, Kim, and Wang (2018) <http://web.mit.edu/insong/www/pdf/tscs.pdf> for full descriptions. When the `qoi` is set to "att", control units have `0` in the treatment variable column, and treated units have `1`. The process is flipped when the `qoi` is set to `atc`. When `ate` is specified as the `qoi`, sets are computed for both the `att` and `atc` setups. 

### Visualize Treatment History Matching

We can use the `DisplayTreatment` function to help better understand this definition and the example given above. In the code below, we will create matched sets as described in **1**, extract one of these sets, and then show the treated unit and matched control units from this set on a plot. We will cover the `PanelMatch` function in more detail later, so do not worry about understanding all of the arguments yet. For now, know that the lag window is specified using the `lag` variable and that we are not refining the matched sets in any other way, aside from the description in **1** (`refinement.method` = "none").

The visualization lines up with our expectations. We see that the control units and the treated unit have identical treatment histories over the lag window (1988-1991). Unit 4 then receives treatment in 1992 and the other units shown do not. 

```{r}
# Create the matched sets
PM.results.none <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
                         treatment = "dem", refinement.method = "none", 
                         data = dem, match.missing = TRUE, 
                         size.match = 5, qoi = "att", outcome.var = "y",
                         lead = 0:4, forbid.treatment.reversal = FALSE, 
                         use.diagonal.variance.matrix = TRUE)
# Extract the first matched set
mset <- PM.results.none$att[1]

# Use the DisplayTreatment function to visualize the 
# treated unit and matched controls.
DisplayTreatment(unit.id = "wbcode2",
                 time.id = "year", legend.position = "none",
                 xlab = "year", ylab = "Country Code",
                 treatment = "dem", data = dem,
                 matched.set = mset, # this way we highlight the particular set
                 show.set.only = TRUE)
```

### Refining Matched Sets
In order to refine the matched sets further and assign weights to control units, the user must specify 1) a method for calculating similarity/distance between units and 2) the variables to be used in similarity/distance calculations to compare units. Other parameters do affect the refinement process, and the user should consult the documentation for a more complete list and set of descriptions. 

#### Select a Refinement Method
Users must choose a refinement method by setting the `refinement.method` argument to one of "mahalanobis", "ps.match", "CBPS.match", "ps.weight", "CBPS.weight", "ps.msm.weight", "CBPS.msm.weight", or "none". The "matching" (any method that has "match" in the name) and mahalanobis refinement methods will assign equal weights to the `size.match` most similar control units in a matched set. The "weighting" methods (methods with "weight" in the name) will generate weights in such a way that control units more similar to treated units will be assigned higher weights. Please consult the documentation and/or Imai, Kim, and Wang (2018) <http://web.mit.edu/insong/www/pdf/tscs.pdf> for descriptions of each refinement method.

#### Variable Selection
Users must also define which covariates should be used in this process for defining similarity between units. This is set using the `covs.formula` argument, which takes the form of a one sided formula object. The variables defined on the right hand side of the formula are the variables used in these calculations. Users can included "lagged" versions of variables using `I(lag(name.of.var, 0:n))`.

We will now turn to understanding how to work with the `PanelMatch` function and the objects it returns to execute and tune these processes.

## Understanding `PanelMatch` and `matched.set` objects

The `PanelMatch` function returns a `PanelMatch` object. These objects contain a number of elements, the most important of which is the `matched.set` object. Within the `PanelMatch` object, the attached `matched.set` object is always named either `att` or `atc`. It should be noted that when `qoi = ate`, then there are two `matched.set` objects included in the results of the `PanelMatch` call. Specifically, there will be two matched sets, named `att` and `atc`, respectively.

In implementation, the `matched.set` is just a named list with some added attributes (lag, names of treatment, unit, and time variables) and a structured name scheme. Each entry in the list corresponds to a matched set of treated and control units. Specifically, the names of the elements correspond to the unit and time identifiers of the treated units. This naming scheme is structured, using the pattern `[id varable]`.`[time variable]`. Each element in the list is a vector indicating the control units (as a vector of the unit ids) that are matched with the treated unit specified in the name of that element.

Recall earlier we used a very basic `PanelMatch` call that did no unit refinement other than matching units on treatment history. We create an additional `PanelMatch` call, specifying that we want to use Mahalanobis distance and lagged versions of the `tradewb` and `y` variables to calculate the distances between units. Because this is a matching method, we will only give weights to the `size.match` most similar control units to each treated unit as determined by the distance calculations. 

```{r}
#Call PanelMatch without any refinement
PM.results.none <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
                         treatment = "dem", refinement.method = "none", 
                         data = dem, match.missing = TRUE, 
                         size.match = 5, qoi = "att", outcome.var = "y",
                         lead = 0:4, forbid.treatment.reversal = FALSE, 
                         use.diagonal.variance.matrix = TRUE)
#Extract the matched.set object
msets.none <- PM.results.none$att

#Call PanelMatch with refinement
PM.results.maha <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
                         treatment = "dem", refinement.method = "mahalanobis", 
                         # use Mahalanobis distance 
                         data = dem, match.missing = TRUE, 
                         covs.formula = ~ tradewb, 
                         size.match = 5, qoi = "att" , outcome.var = "y",
                         lead = 0:4, forbid.treatment.reversal = FALSE, 
                         use.diagonal.variance.matrix = TRUE)

msets.maha <- PM.results.maha$att
```

### Examining `matched.set` objects

### `print`, `summary`, and `plot`

Methods for printing, summarizing, and plotting `matched.set` objects are implemented.  We can print basic information about the matched sets using the `print` method. The output will show information about the treated units and the number of control units in each matched set. *Note that this does not consider the weights of control units within each set. The matched set sizes shown in the output of `print.matched.set` will show the number of controls matched to a treated unit based on treatment history only.* Matched set objects are implemented as lists, but the default printing behavior resembles that of a data frame. One can toggle a `verbose` option on the `print` method to print as a list and also display a less summarized version of the matched set data.

Despite the refinement applied to the `msets.maha` object, the matched set sizes appear identical because the weights of control units are not considered in this view -- just treatment history. Users are able to view more information about these objects with the `summary` function.

```{r}
print(msets.none)
print(msets.maha)

summary(msets.none)
summary(msets.maha)
```

This information can also be shown graphically. Using the `plot` method, users can visualize the distribution of the size of matched sets. By default, a red line will indicate the number of matched sets where treated units were unable to be matched with any control units (eg. the number of empty matched sets). These plots can be adjusted using the arguments described in the documentation, or any other arguments normally passed to `graphics::plot`.

```{r}
plot(msets.none)
```


#### Subsetting `matched.set` objects
Since `matched.set` objects are just lists with attributes, you can expect the `[` and `[[` functions to work similarly to how they would with a list. So, for instance, users can extract information about matched sets using numerical indices or by taking advantage of the naming scheme. 

```{r}
msets.maha[1] #prints like a matched.set object
msets.maha[[1]] #prints the matched control unit ids.
msets.maha[["4.1992"]] # same behavior as above
```


### Control Unit Weights

Each matched set will have an attribute named "weights." This vector indicates the weights calculated for each control unit in the refinement process. Each of the possible refinement methods behave differently. We continue with our running examples to observe this. When the `refinement.method` is set to `none`, all control units will receive equal weights. Using the Mahalanobis distance refinement method will identify the `size.match` most similar units to each treated unit. These units then receive a weight, while all other control units receive a weight of zero. We can use R's standard interface for working with object attributes to examine the weights. The non-zero weights assigned to control units will always sum to one in a given matched set.


```{r}
# Examine the weights assinged to control units in first matched set
attr(msets.none[[1]], "weights")
attr(msets.maha[[1]], "weights")
```

### Comparing Methods of Refinement

While we have mostly looked at a simple case of refinement using Mahalanobis distance so far, there are a number of ways to tune the refinement process. There are no hard and fast rules for determining the best configuration. Instead, users should use their substantive knowledge to experiment with and evaluate a number of different setups. In general, users should consider 1) the number of matched sets 2) the number of controls matched to each treated unit and 3) covariate balance as they configure `PanelMatch`. Having a large number of small matched sets will create larger standard errors in the estimation stage. Poorly balanced covariates suggest undesirable comparisons between treated and control units. Users ought to consider the method of refinement, the variables used for calculating weights, the size of the lag window, the desired procedures for handling missing data (See the `match.missing` and `listwise.delete` arguments in the `PanelMatch` function documentation), and the maximum size of matched sets (when using a matching method). The package aims to provide a number of featres to help users with this process. First, the `print`, `plot`, and `summary` methods for `matched.set` objects will help users study **1** and **2**. To help evaluate covariate balance, users can take advantage of the `get_covariate_balance` function. 

#### Using `get_covariate_balance`

The following code runs a number of different `PanelMatch` refinement configurations and calculates the covariate balance for each setup. Lower values in the covariate balance calculations are desirable. See the documentation of `get_covariate_balance` for more detailed information. 

```{r}
PM.results.none <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
                         treatment = "dem", refinement.method = "none", 
                         data = dem, match.missing = TRUE, 
                         size.match = 5, qoi = "att", outcome.var = "y",
                         lead = 0:4, forbid.treatment.reversal = FALSE, 
                         use.diagonal.variance.matrix = TRUE)


PM.results.maha <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
                         treatment = "dem", refinement.method = "mahalanobis", 
                         data = dem, match.missing = TRUE, 
                         covs.formula = ~ I(lag(tradewb, 1:4)) + I(lag(y, 1:4)), 
                         size.match = 5, qoi = "att", outcome.var = "y",
                         lead = 0:4, forbid.treatment.reversal = FALSE, 
                         use.diagonal.variance.matrix = TRUE)

# listwise deletion used for missing data
PM.results.listwise <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
                         treatment = "dem", refinement.method = "mahalanobis", 
                         data = dem, match.missing = FALSE, listwise.delete = TRUE, 
                         covs.formula = ~ I(lag(tradewb, 1:4)) + I(lag(y, 1:4)), 
                         size.match = 5, qoi = "att", outcome.var = "y",
                         lead = 0:4, forbid.treatment.reversal = FALSE, 
                         use.diagonal.variance.matrix = TRUE)

# propensity score based weighting method
PM.results.ps.weight <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
                         treatment = "dem", refinement.method = "ps.weight", 
                         data = dem, match.missing = FALSE, listwise.delete = TRUE, 
                         covs.formula = ~ I(lag(tradewb, 1:4)) + I(lag(y, 1:4)), 
                         size.match = 5, qoi = "att", outcome.var = "y",
                         lead = 0:4, forbid.treatment.reversal = FALSE)

get_covariate_balance(PM.results.none$att,
                      data = dem,
                      covariates = c("tradewb", "y"),
                      plot = FALSE)

get_covariate_balance(PM.results.maha$att,
                      data = dem,
                      covariates = c("tradewb", "y"),
                      plot = FALSE)

get_covariate_balance(PM.results.listwise$att,
                      data = dem,
                      covariates = c("tradewb", "y"), 
                      plot = FALSE)

get_covariate_balance(PM.results.ps.weight$att,
                      data = dem,
                      covariates = c("tradewb", "y"), 
                      plot = FALSE)


```

The `get_covariate_balance` function has a few useful options. Users can generate plots showing the covariate balance by setting `plot = TRUE`. These plots can be customized using the same arguments one would normally pass to the base R `plot` method. Users can also set `use.equal.weights = TRUE` to easily get the balance of unrefined sets. This makes it easy to understand the impact of refinement.  


```{r}
# Use equal weights
get_covariate_balance(PM.results.ps.weight$att,
                      data = dem,
                      use.equal.weights = TRUE,
                      covariates = c("tradewb", "y"), 
                      plot = TRUE, 
                      # visualize by setting plot to TRUE
                      ylim = c(-1, 1))

# Compare covariate balance to refined sets
get_covariate_balance(PM.results.ps.weight$att,
                      data = dem,
                      covariates = c("tradewb", "y"), 
                      plot = TRUE, 
                      # visualize by setting plot to TRUE
                      ylim = c(-1, 1))
```

We can also evaluate our results using the `balance_scatter` function:

```{r}
balance_scatter(non_refined_set = PM.results.none$att,
               refined_list = list(PM.results.maha$att, PM.results.ps.weight$att),
               data = dem,
               covariates = c("y", "tradewb"))
```


# PanelEstimate

We now move to the other major part of the package: obtaining point estimates and standard errors using `PanelEstimate`. The package uses bootstrapping to obtain standard errors. By default, 1000 bootstrap iterations are used, and .95 is the default confidence level. 

The function returns a `PanelEstimate` object, which behaves like a list, much like the other objects in the package. As such, you can access the various elements just as you would in a list.

```{r}
PE.results <- PanelEstimate(sets = PM.results.ps.weight, data = dem) 

# View the point estimates 
PE.results[["estimates"]]

```

## PanelEstimate Methods

`PanelEsimate` objects have custom `summary` and `plot` methods defined. The `plot` method can be customized with all of the same arguments/operations as the regular `plot` function in base R. This method shows the point estimates for the specified lead window periods, along with the standard errors.

```{r}
summary(PE.results)

plot(PE.results)
```

This last plot makes it clear that the effect of treatment on treated units (att) in this configuration is statistically insignificant.

## Moderating Variables

The `PanelEstimate` function can handle moderating variables. When a moderating variable is specified in the `PanelEstimate` function, a list of `PanelEstimate` objects are returned, each with the structure of a standard `PanelEstimate` object. There will be a separate `PanelEstimate` object in the list for each value of the moderating variable, and the name of each element in the returned list will reflect this value. The moderating variable is assumed to be categorical.

```{r}
# add simple moderating variable
dem$moderator <- 0
dem$moderator <- ifelse(dem$wbcode2 > 100, 1, 2)


PM.results <- PanelMatch(lag = 4, time.id = "year", unit.id = "wbcode2", 
                         treatment = "dem", refinement.method = "mahalanobis", 
                         data = dem, match.missing = TRUE, 
                         covs.formula = ~ I(lag(tradewb, 1:4)) + I(lag(y, 1:4)), # lags
                         size.match = 5, qoi = "att", outcome.var = "y",
                         lead = 0:4, forbid.treatment.reversal = FALSE, 
                         use.diagonal.variance.matrix = TRUE)

PE.results <- PanelEstimate(sets = PM.results, data = dem, moderator = "moderator")

# Names correspond to 
# moderator values
names(PE.results)

# Extract each result
plot(PE.results[[1]])
plot(PE.results[[2]])
```