Matching observations based on generalized propensity scores involves
tuning multiple hyperparameters. Running separate workflows with
different parameter combinations can be tedious, and the effects of some
parameters are not always predictable. To streamline this process,
vecmatch provides an automated optimization workflow using a random
search algorithm. The function optimize_gps()
is
implemented with multithreading to leverage computational resources
efficiently.
Step 1: Define the Formula, Data, and Optimization Space
In this example, we use the built-in cancer
dataset and
focus on two predictors: the categorical variable sex
and
the continuous variable age
. We first specify the model
formula. Note that data
and formula
are fixed
inputs; if you want to compare different formulas, you must run the
workflow separately for each.
Next, we define the search space for the hyperparameters. The helper
function make_opt_args()
validates your inputs and computes
the Cartesian product of all specified values, reporting the total
number of parameter combinations.
opt_args <- make_opt_args(
data = cancer,
formula = formula_cancer,
reference = c("control", "adenoma", "crc_beningn", "crc_malignant"),
gps_method = c("m1", "m7", "m8"),
matching_method = c("fullopt", "nnm"),
caliper = seq(0.01, 5, 0.01),
cluster = 1:3,
ratio = 1:3,
min_controls = 1:3,
max_controls = 1:3
)
opt_args
#> Optimization Argument Set (class: opt_args)
#> ----------------------------------------
#> gps_method : m1, m7, m8
#> reference : control, adenoma, crc_beningn, crc_malignant
#> matching_method : fullopt, nnm
#> caliper : [500 values]
#> order : desc, asc, original, random
#> cluster : 1, 2, 3
#> replace : TRUE, FALSE
#> ties : TRUE, FALSE
#> ratio : 1, 2, 3
#> min_controls : 1, 2, 3
#> max_controls : 1, 2, 3
#> ----------------------------------------
#> Total combinations: 1512000
The print()
method for make_opt_args()
provides a clear summary of the search space, including each
hyperparameter’s values and the total number of combinations.
With the search space defined, we can launch the optimization. The
optimize_gps()
function performs a random search across the
parameter grid and returns a summary table containing key quality
metrics for each tested combination. You control the number of
iterations via the n_iter
argument.
The function uses multithreading (via the future
package) to parallelize work. As a guideline, aim for at least 1000–1500
iterations per core for reliable search coverage. Monitor your system’s
memory usage, since the parallel backend can consume substantial
RAM.
The function automatically registers a multisession backend and,
after computing and matching the GPS with foreach
and
doRNG
, it cleans up the parallel workers to free memory. In
a future release, we plan to allow users to select alternative backends,
such as an external compute cluster, for greater flexibility.
By default, optimize_gps()
preserves the global random
seed. For reproducibility, set a seed before calling the optimizer.
set.seed(167894)
seed_before <- .Random.seed
opt_results <- optimize_gps(
data = cancer,
formula = formula_cancer,
opt_args = opt_args,
n_iter = 1500
)
#> Best Optimization Results by SMD Group
#> ======================================
#>
#> -------------------------------------------------------
#> | smd_group | unique_configs | smd | perc_matched |
#> -------------------------------------------------------
#> |0-0.05 | 1| 0.038| 79.07|
#> |0.05-0.10 | 1| 0.086| 79.4|
#> |0.10-0.15 | 1| 0.145| 91.2|
#> |0.15-0.20 | 2| 0.191| 95.68|
#> |0.20-0.25 | 1| 0.213| 93.19|
#> |0.25-0.30 | 1| 0.25| 73.39|
#> |>0.30 | 1| 0.359| 62.81|
#> -------------------------------------------------------
#>
#> Optimization Summary
#> --------------------
#> Total combinations tested : 1500
#> Total optimization time [s]: 91.09
We ran the optimization on a single core with
n_iter = 1500
; on our test machine this required 91.09
seconds. Given the size of the parameter grid, increasing
n_iter
would improve the search’s coverage, but here we
limited iterations to keep the vignette’s build time reasonable.
When you print opt_results
, it summarizes the entire
search by grouping parameter sets into bins defined by their maximum
standardized mean difference (SMD), and within each bin it highlights
the combination that achieves the highest proportion of matched
observations.
After optimization, select_opt()
helps you choose
parameter combinations that meet your specific objectives. For example,
you may wish to maximize the number of matched samples for certain
treatment groups while minimizing imbalance on key covariates.
In this example, we aim to: * Retain the largest possible number of
observations in the adenoma
and crc_malignant
groups. * Minimize the standardized mean difference (SMD) for the
age
variable.
We can achieve this by specifying arguments in
select_opt()
:
select_results <- select_opt(opt_results,
perc_matched = c("adenoma", "crc_malignant"),
smd_variables = "age",
smd_type = "max"
)
#>
#> Optimization Selection Summary
#> ====================
#>
#> -------------------------------------------------------
#> | smd_group | unique_configs | smd | perc_matched |
#> -------------------------------------------------------
#> |0-0.05 | 1| 0.043| 162.957|
#> |0.05-0.10 | 1| 0.099| 176.715|
#> |0.10-0.15 | 1| 0.145| 189.474|
#> |0.15-0.20 | 2| 0.191| 194.611|
#> |0.20-0.25 | 1| 0.213| 190.179|
#> |0.25-0.30 | 1| 0.25| 148.874|
#> |0.30-0.35 | 1| 0.321| 99.426|
#> |0.35-0.40 | 1| 0.356| 94.214|
#> |0.40-0.45 | 1| 0.429| 113.729|
#> |0.45-0.50 | 1| 0.475| 111.038|
#> |>0.50 | 1| 0.551| 127.861|
#> -------------------------------------------------------
The output shows the SMD bins and highlights the combination within
each bin that best meets our criteria. Suppose the configuration in the
0.10–0.15
SMD bin offers a desirable balance of matched
samples; we can extract its parameters for manual refitting.
To inspect the matched dataset and detailed balance summaries, we
rerun the standard vecmatch
workflow using the selected
parameters.
param_df <- attr(select_results, "param_df")
subset(param_df, smd_group == "0.10-0.15")
#> iter_ID gps_model method_match caliper order kmeans_cluster replace ties
#> 6 ID565 estimate_4 nnm 4.29 asc 1 TRUE TRUE
#> ratio min_controls max_controls reference p_control p_adenoma.x p_crc_beningn
#> 6 2 NA NA adenoma 87.22045 100 85.23985
#> p_crc_malignant.x method_gps link age overall_stat
#> 6 89.47368 multinom generalized_logit 0.144529 0.144529
#> smd_group p_adenoma.y p_crc_malignant.y
#> 6 0.10-0.15 100 89.47368
estimate_gps()
and
match_gps()
. For example:# estimating gps
gps_mat <- estimate_gps(formula_cancer,
cancer,
method = "multinom",
link = "generalized_logit",
reference = "adenoma"
)
# csr with refitting
csr_mat <- csregion(gps_mat)
#>
#> Rectangular CSR Borders Evaluation
#> ==================================
#>
#> Treatment | Lower CSR limit | Upper CSR limit | Number excluded
#> --------------------------------------------------------------------
#> adenoma | 0.2185785 | 0.3541997 | 10
#> control | 0.1710161 | 0.3226619 | 12
#> crc_beningn | 0.1766004 | 0.2903609 | 13
#> crc_malignant | 0.1384517 | 0.2641614 | 11
#>
#> ===================================================
#> The total number of excluded observations is: 20
#> Note: You can view the summary of the CSR calculation using the `attr()` function.
# matching
matched_df <- match_gps(csr_mat,
method = "nnm",
caliper = 4.29,
reference = "adenoma",
ratio = 2,
order = "asc",
replace = TRUE,
ties = TRUE,
kmeans_cluster = 1
)
balqual(matched_df,
formula = formula_cancer,
type = "smd",
statistic = "max",
round = 3,
cutoffs = 0.2
)
#>
#> Matching Quality Evaluation
#> ================================================================================
#>
#> Count table for the treatment variable:
#> --------------------------------------------------
#> Treatment | Before | After
#> --------------------------------------------------
#> adenoma | 373 | 373
#> control | 313 | 273
#> crc_beningn | 271 | 231
#> crc_malignant | 247 | 221
#> --------------------------------------------------
#>
#>
#> Matching summary statistics:
#> ----------------------------------------
#> Total n before matching: 1204
#> Total n after matching: 1098
#> % of matched observations: 91.20 %
#> Total maximal SMD value: 0.145
#>
#>
#> Maximal values :
#> --------------------------------------------------------------------------------
#> Variable | Coef | Before | After | Quality
#> --------------------------------------------------------------------------------
#> age | SMD | 0.215 | 0.145 | Balanced
#> sexF | SMD | 0.148 | 0.117 | Balanced
#> sexM | SMD | 0.148 | 0.117 | Balanced
#> age:sexF | SMD | 0.155 | 0.122 | Balanced
#> age:sexM | SMD | 0.159 | 0.136 | Balanced
#> --------------------------------------------------------------------------------
all.equal(seed_before, .Random.seed)
#> [1] TRUE