Optimizing the Matching Process with a Random Search Algorithm

Practical Example: Optimizing the Matching Process

Matching observations based on generalized propensity scores involves tuning multiple hyperparameters. Running separate workflows with different parameter combinations can be tedious, and the effects of some parameters are not always predictable. To streamline this process, vecmatch provides an automated optimization workflow using a random search algorithm. The function optimize_gps() is implemented with multithreading to leverage computational resources efficiently.

Step 1: Define the Formula, Data, and Optimization Space

In this example, we use the built-in cancer dataset and focus on two predictors: the categorical variable sex and the continuous variable age. We first specify the model formula. Note that data and formula are fixed inputs; if you want to compare different formulas, you must run the workflow separately for each.

Next, we define the search space for the hyperparameters. The helper function make_opt_args() validates your inputs and computes the Cartesian product of all specified values, reporting the total number of parameter combinations.

opt_args <- make_opt_args(
  data            = cancer,
  formula         = formula_cancer,
  reference       = c("control", "adenoma", "crc_beningn", "crc_malignant"),
  gps_method      = c("m1", "m7", "m8"),
  matching_method = c("fullopt", "nnm"),
  caliper         = seq(0.01, 5, 0.01),
  cluster         = 1:3,
  ratio           = 1:3,
  min_controls    = 1:3,
  max_controls    = 1:3
)

opt_args
#> Optimization Argument Set (class: opt_args)
#> ---------------------------------------- 
#> gps_method      : m1, m7, m8
#> reference       : control, adenoma, crc_beningn, crc_malignant
#> matching_method : fullopt, nnm
#> caliper         : [500 values]
#> order           : desc, asc, original, random
#> cluster         : 1, 2, 3
#> replace         : TRUE, FALSE
#> ties            : TRUE, FALSE
#> ratio           : 1, 2, 3
#> min_controls    : 1, 2, 3
#> max_controls    : 1, 2, 3
#> ---------------------------------------- 
#> Total combinations: 1512000

The print() method for make_opt_args() provides a clear summary of the search space, including each hyperparameter’s values and the total number of combinations.

Step 2: Run the Optimizer

With the search space defined, we can launch the optimization. The optimize_gps() function performs a random search across the parameter grid and returns a summary table containing key quality metrics for each tested combination. You control the number of iterations via the n_iter argument.

The function uses multithreading (via the future package) to parallelize work. As a guideline, aim for at least 1000–1500 iterations per core for reliable search coverage. Monitor your system’s memory usage, since the parallel backend can consume substantial RAM.

The function automatically registers a multisession backend and, after computing and matching the GPS with foreach and doRNG, it cleans up the parallel workers to free memory. In a future release, we plan to allow users to select alternative backends, such as an external compute cluster, for greater flexibility.

By default, optimize_gps() preserves the global random seed. For reproducibility, set a seed before calling the optimizer.

set.seed(167894)
seed_before <- .Random.seed

opt_results <- optimize_gps(
  data     = cancer,
  formula  = formula_cancer,
  opt_args = opt_args,
  n_iter   = 1500
)
#> Best Optimization Results by SMD Group
#> ======================================
#> 
#> ------------------------------------------------------- 
#> | smd_group  | unique_configs |  smd   | perc_matched |
#> ------------------------------------------------------- 
#> |0-0.05      |               1|   0.038|         79.07|
#> |0.05-0.10   |               1|   0.086|          79.4|
#> |0.10-0.15   |               1|   0.145|          91.2|
#> |0.15-0.20   |               2|   0.191|         95.68|
#> |0.20-0.25   |               1|   0.213|         93.19|
#> |0.25-0.30   |               1|    0.25|         73.39|
#> |>0.30       |               1|   0.359|         62.81|
#> ------------------------------------------------------- 
#> 
#> Optimization Summary
#> --------------------
#> Total combinations tested  : 1500
#> Total optimization time [s]: 91.09

We ran the optimization on a single core with n_iter = 1500; on our test machine this required 91.09 seconds. Given the size of the parameter grid, increasing n_iter would improve the search’s coverage, but here we limited iterations to keep the vignette’s build time reasonable.

When you print opt_results, it summarizes the entire search by grouping parameter sets into bins defined by their maximum standardized mean difference (SMD), and within each bin it highlights the combination that achieves the highest proportion of matched observations.

Step 3: Select Optimal Configurations

After optimization, select_opt() helps you choose parameter combinations that meet your specific objectives. For example, you may wish to maximize the number of matched samples for certain treatment groups while minimizing imbalance on key covariates.

In this example, we aim to: * Retain the largest possible number of observations in the adenoma and crc_malignant groups. * Minimize the standardized mean difference (SMD) for the age variable.

We can achieve this by specifying arguments in select_opt():

select_results <- select_opt(opt_results,
  perc_matched = c("adenoma", "crc_malignant"),
  smd_variables = "age",
  smd_type = "max"
)
#> 
#> Optimization Selection Summary
#> ====================
#> 
#> ------------------------------------------------------- 
#> | smd_group  | unique_configs |  smd   | perc_matched |
#> ------------------------------------------------------- 
#> |0-0.05      |               1|   0.043|       162.957|
#> |0.05-0.10   |               1|   0.099|       176.715|
#> |0.10-0.15   |               1|   0.145|       189.474|
#> |0.15-0.20   |               2|   0.191|       194.611|
#> |0.20-0.25   |               1|   0.213|       190.179|
#> |0.25-0.30   |               1|    0.25|       148.874|
#> |0.30-0.35   |               1|   0.321|        99.426|
#> |0.35-0.40   |               1|   0.356|        94.214|
#> |0.40-0.45   |               1|   0.429|       113.729|
#> |0.45-0.50   |               1|   0.475|       111.038|
#> |>0.50       |               1|   0.551|       127.861|
#> -------------------------------------------------------

The output shows the SMD bins and highlights the combination within each bin that best meets our criteria. Suppose the configuration in the 0.10–0.15 SMD bin offers a desirable balance of matched samples; we can extract its parameters for manual refitting.

Step 4: Refit the Optimized Model Manually

To inspect the matched dataset and detailed balance summaries, we rerun the standard vecmatch workflow using the selected parameters.

  1. Extract the parameter data frame and filter the chosen SMD bin:
param_df <- attr(select_results, "param_df")

subset(param_df, smd_group == "0.10-0.15")
#>   iter_ID  gps_model method_match caliper order kmeans_cluster replace ties
#> 6   ID565 estimate_4          nnm    4.29   asc              1    TRUE TRUE
#>   ratio min_controls max_controls reference p_control p_adenoma.x p_crc_beningn
#> 6     2           NA           NA   adenoma  87.22045         100      85.23985
#>   p_crc_malignant.x method_gps              link      age overall_stat
#> 6          89.47368   multinom generalized_logit 0.144529     0.144529
#>   smd_group p_adenoma.y p_crc_malignant.y
#> 6 0.10-0.15         100          89.47368
  1. Use these values to call estimate_gps() and match_gps(). For example:
# estimating gps
gps_mat <- estimate_gps(formula_cancer,
  cancer,
  method = "multinom",
  link = "generalized_logit",
  reference = "adenoma"
)
# csr with refitting
csr_mat <- csregion(gps_mat)
#> 
#> Rectangular CSR Borders Evaluation 
#> ==================================
#> 
#> Treatment       | Lower CSR limit | Upper CSR limit | Number excluded 
#> -------------------------------------------------------------------- 
#> adenoma         | 0.2185785       | 0.3541997       | 10              
#> control         | 0.1710161       | 0.3226619       | 12              
#> crc_beningn     | 0.1766004       | 0.2903609       | 13              
#> crc_malignant   | 0.1384517       | 0.2641614       | 11              
#> 
#> ===================================================
#> The total number of excluded observations is:     20 
#> Note: You can view the summary of the CSR calculation using the  `attr()` function.

# matching
matched_df <- match_gps(csr_mat,
  method = "nnm",
  caliper = 4.29,
  reference = "adenoma",
  ratio = 2,
  order = "asc",
  replace = TRUE,
  ties = TRUE,
  kmeans_cluster = 1
)
  1. Evaluate balance and verify reproducibility:
balqual(matched_df,
  formula = formula_cancer,
  type = "smd",
  statistic = "max",
  round = 3,
  cutoffs = 0.2
)
#> 
#> Matching Quality Evaluation
#> ================================================================================ 
#> 
#> Count table for the treatment variable:
#> -------------------------------------------------- 
#> Treatment                 | Before     | After      
#> -------------------------------------------------- 
#> adenoma                   | 373        | 373        
#> control                   | 313        | 273        
#> crc_beningn               | 271        | 231        
#> crc_malignant             | 247        | 221        
#> -------------------------------------------------- 
#> 
#> 
#> Matching summary statistics:
#> ---------------------------------------- 
#> Total n before matching:  1204 
#> Total n after matching:       1098 
#> % of matched observations:    91.20 %
#> Total  maximal   SMD value:   0.145 
#> 
#> 
#> Maximal values :
#> -------------------------------------------------------------------------------- 
#> Variable                  | Coef  | Before       | After        | Quality      
#> -------------------------------------------------------------------------------- 
#> age                       | SMD   | 0.215        | 0.145        | Balanced     
#> sexF                      | SMD   | 0.148        | 0.117        | Balanced     
#> sexM                      | SMD   | 0.148        | 0.117        | Balanced     
#> age:sexF                  | SMD   | 0.155        | 0.122        | Balanced     
#> age:sexM                  | SMD   | 0.159        | 0.136        | Balanced     
#> --------------------------------------------------------------------------------

all.equal(seed_before, .Random.seed)
#> [1] TRUE