This document explains in detail the convergence criteria and error monitoring.
For basic usage of the missForestPredict
package read
the Using the missForestPredict package vignette.
missForestPredict
imputes each variable in a dataset in
an iterative fashion, using an adapted version of the
misForest
algorithm (Stekhoven and
Bühlmann 2012). By default, convergence is calculated based on
the OOB error, but the apparent error can be used too. At each iteration
the out-of-bag (OOB) error is calculated for each variable separately.
To obtain a global error the OOB errors for all variables a weighted
average is used, that can be controlled by the var_weights
parameter. By default the weight of each variable in the convergence
criteria is set to the proportion of missing values in the dataset.
The normalized mean square error (NMSE) is used for both continuous and categorical variables. For continuous variables, it is equivalent to \(1 - R^2\). For categorical variables, it is equivalent to \(1 - BSS\) (Brier Skill Score).
Continuous variables:
\(NMSE = \frac{\sum_{i=1}^{N}(x_i - \hat{x_i})^2}{\sum_{i=1}^{N} (x_i - \bar{x})^2} = 1 - R^2\), \(i = 1, 2, ... N\)
\({x_i}\) = the true value of variable x for observation i
\(\bar{x}\) = the mean value of variable x
\(\hat{x_i}\) = prediction (imputation) of variable x for observation i
\(N\) = number of observations
Categorical variables (including ordinal factors):
\(NMSE = \frac{BS}{BSref} = 1 - BSS\)
\(BS =\frac{1}{N}\sum_{j=1}^{R}\sum_{i=1}^{N}(p_{ij} - x_{ij})^2\), \(i = 1, 2, ... N\), \(j = 1, 2, ... R\)
\(BSref =\frac{1}{N}\sum_{j=1}^{R}\sum_{i=1}^{N}(p_{j} - x_{ij})^2=1-\sum_{j=1}^{R}p_j^2\)
\({x_{ij}}\) = the true value of variable x for observation i and class j (1 if observation i has class j and 0 otherwise)
\(p_{ij}\) = prediction (probability) for observation i and class j
\(p_{j}\) = proportion of the event in class j
\(N\) = number of observations
\(R\) = number of classes
The Brier Score (\(BS\)) is calculated as the sum of square distances between the predictions (as probabilities) and the true values (0 or 1) for each class. The reference Brier Score (\(BSref\)) is calculated as the Brier Score of a predictor that predicts the proportion of the event in each class (Brier et al. 1950).
At each iteration, the MSE (mean square error) and NMSE (normalized
mean square error) are saved and reported for each variable if
verbose = TRUE
.
Post imputation, the error can be checked using the
evaluate_imputation_error
function.
Please note that in all following examples we set the
ranger
parameter num.threads
to 2. If you are
running the code on a machine with more cores and you are willing to use
them for running the code, you can remove this parameter completely.
Iris data The iris dataset in R base contains 4 continuous variables and one categorical variable with three categories for N = 150 flowers (Anderson 1935).
Diamonds data The diamonds dataset from
ggplot2
R package contains seven continuous variables and
three categorical variables for N = 53940 diamonds (Wickham 2016).
The MSE and NMSE errors are returned at the end of the imputation as part of the return object.
library(missForestPredict)
data(iris)
set.seed(2022)
iris_mis <- produce_NA(iris, proportion = 0.5)
set.seed(2022)
missForest_object <- missForestPredict::missForest(iris_mis, verbose = TRUE, num.threads = 2)
#> Imputation sequence (missing proportion): Sepal.Length (0.5) Sepal.Width (0.5) Petal.Length (0.5) Petal.Width (0.5) Species (0.5)
#> missForest iteration 1 in progress...done!
#> OOB errors MSE: 0.2631490086, 0.1190566363, 0.2970492214, 0.0408396181, 0.0448929621
#> OOB errors NMSE: 0.369143355, 0.7511649273, 0.1023088421, 0.0703699959, 0.2027753574
#> diff. convergence measure: 0.7008475044
#> time: 0.333 seconds
#>
#> missForest iteration 2 in progress...done!
#> OOB errors MSE: 0.0716818361, 0.1060503075, 0.1422825025, 0.0317743183, 0.0343722204
#> OOB errors NMSE: 0.1005547146, 0.6691039997, 0.049004532, 0.0547497443, 0.155254609
#> diff. convergence measure: 0.0934189756
#> time: 0.295 seconds
#>
#> missForest iteration 3 in progress...done!
#> OOB errors MSE: 0.0657252116, 0.0758944172, 0.0814812491, 0.0275581206, 0.0339381295
#> OOB errors NMSE: 0.0921988087, 0.4788412152, 0.0280635385, 0.0474848915, 0.1532938797
#> diff. convergence measure: 0.0457570532
#> time: 0.277 seconds
#>
#> missForest iteration 4 in progress...done!
#> OOB errors MSE: 0.0743593329, 0.0735555198, 0.0829895296, 0.0295329478, 0.0315113268
#> OOB errors NMSE: 0.1043106861, 0.4640843921, 0.0285830161, 0.05088768, 0.1423323447
#> diff. convergence measure: 0.0019368429
#> time: 0.275 seconds
#>
#> missForest iteration 5 in progress...done!
#> OOB errors MSE: 0.0712202983, 0.0742891292, 0.0795972937, 0.0264270381, 0.032471418
#> OOB errors NMSE: 0.099907273, 0.4687129593, 0.0274146719, 0.0455359441, 0.146668945
#> diff. convergence measure: 0.0003916652
#> time: 0.284 seconds
#>
#> missForest iteration 6 in progress...done!
#> OOB errors MSE: 0.0737972482, 0.0804233789, 0.0794809263, 0.0292423785, 0.0239687394
#> OOB errors NMSE: 0.1035221981, 0.5074158271, 0.027374593, 0.0503870055, 0.1082635111
#> diff. convergence measure: -0.0017446683
#> time: 0.262 seconds
print(missForest_object$OOB_err)
#> iteration variable MSE NMSE MER macro_F1 F1_score
#> 1 0 Sepal.Length NA 1.00000000 NA NA NA
#> 2 0 Sepal.Width NA 1.00000000 NA NA NA
#> 3 0 Petal.Length NA 1.00000000 NA NA NA
#> 4 0 Petal.Width NA 1.00000000 NA NA NA
#> 5 0 Species NA 1.00000000 NA NA NA
#> 6 1 Sepal.Length 0.26314901 0.36914336 NA NA NA
#> 7 1 Sepal.Width 0.11905664 0.75116493 NA NA NA
#> 8 1 Petal.Length 0.29704922 0.10230884 NA NA NA
#> 9 1 Petal.Width 0.04083962 0.07037000 NA NA NA
#> 10 1 Species 0.04489296 0.20277536 0.08000000 0.9245014 NA
#> 11 2 Sepal.Length 0.07168184 0.10055471 NA NA NA
#> 12 2 Sepal.Width 0.10605031 0.66910400 NA NA NA
#> 13 2 Petal.Length 0.14228250 0.04900453 NA NA NA
#> 14 2 Petal.Width 0.03177432 0.05474974 NA NA NA
#> 15 2 Species 0.03437222 0.15525461 0.08000000 0.9226845 NA
#> 16 3 Sepal.Length 0.06572521 0.09219881 NA NA NA
#> 17 3 Sepal.Width 0.07589442 0.47884122 NA NA NA
#> 18 3 Petal.Length 0.08148125 0.02806354 NA NA NA
#> 19 3 Petal.Width 0.02755812 0.04748489 NA NA NA
#> 20 3 Species 0.03393813 0.15329388 0.08000000 0.9228758 NA
#> 21 4 Sepal.Length 0.07435933 0.10431069 NA NA NA
#> 22 4 Sepal.Width 0.07355552 0.46408439 NA NA NA
#> 23 4 Petal.Length 0.08298953 0.02858302 NA NA NA
#> 24 4 Petal.Width 0.02953295 0.05088768 NA NA NA
#> 25 4 Species 0.03151133 0.14233234 0.08000000 0.9228758 NA
#> 26 5 Sepal.Length 0.07122030 0.09990727 NA NA NA
#> 27 5 Sepal.Width 0.07428913 0.46871296 NA NA NA
#> 28 5 Petal.Length 0.07959729 0.02741467 NA NA NA
#> 29 5 Petal.Width 0.02642704 0.04553594 NA NA NA
#> 30 5 Species 0.03247142 0.14666894 0.06666667 0.9355050 NA
#> 31 6 Sepal.Length 0.07379725 0.10352220 NA NA NA
#> 32 6 Sepal.Width 0.08042338 0.50741583 NA NA NA
#> 33 6 Petal.Length 0.07948093 0.02737459 NA NA NA
#> 34 6 Petal.Width 0.02924238 0.05038701 NA NA NA
#> 35 6 Species 0.02396874 0.10826351 0.06666667 0.9355050 NA
#> 36 7 Sepal.Length NA NA NA NA NA
#> 37 7 Sepal.Width NA NA NA NA NA
#> 38 7 Petal.Length NA NA NA NA NA
#> 39 7 Petal.Width NA NA NA NA NA
#> 40 7 Species NA NA NA NA NA
#> 41 8 Sepal.Length NA NA NA NA NA
#> 42 8 Sepal.Width NA NA NA NA NA
#> 43 8 Petal.Length NA NA NA NA NA
#> 44 8 Petal.Width NA NA NA NA NA
#> 45 8 Species NA NA NA NA NA
#> 46 9 Sepal.Length NA NA NA NA NA
#> 47 9 Sepal.Width NA NA NA NA NA
#> 48 9 Petal.Length NA NA NA NA NA
#> 49 9 Petal.Width NA NA NA NA NA
#> 50 9 Species NA NA NA NA NA
#> 51 10 Sepal.Length NA NA NA NA NA
#> 52 10 Sepal.Width NA NA NA NA NA
#> 53 10 Petal.Length NA NA NA NA NA
#> 54 10 Petal.Width NA NA NA NA NA
#> 55 10 Species NA NA NA NA NA
These can be plotted in a graph if visual inspection seems easier to
understand. We will plot the errors using ggplot2
package.
library(dplyr)
library(tidyr)
library(ggplot2)
missForest_object$OOB_err %>%
filter(!is.na(NMSE)) %>%
ggplot(aes(iteration, NMSE, col = variable)) +
geom_point() +
geom_line()
The convergence of the algorithm is based on the a weighted average
of the OOB NMSE for each variable. The weights are proportional to the
proportion of missing values in the dataset. There can be situations
when this in not the optimal choice. The weights for each variable can
be adjusted via the var_weights
parameter. In the following
example we will create different proportion of missing values for each
variable and adjust the weights to be equal.
We will create missing values only on the last two variables
(Petal.Width and Species) and first run the imputation with the default
setting. Keep in mind that missForestPredict
builds models
for all variables in the dataset, regardless of the missingness rate.
Models will be built also for variables that are complete. Their weight
will be zero in the convergence criteria, but imputation models will
still be stored for these variables and can be later used on new
observations; if unexpectedly missing values will occur in your test
set, these will be imputed using these learned models.
library(missForestPredict)
library(dplyr)
library(tidyr)
library(ggplot2)
data(iris)
proportion_missing <- c(0, 0, 0, 0.3, 0.3)
set.seed(2022)
iris_mis <- produce_NA(iris, proportion = proportion_missing)
set.seed(2022)
missForest_object <- missForestPredict::missForest(iris_mis, verbose = TRUE, num.threads = 2)
#> Imputation sequence (missing proportion): Sepal.Length (0) Sepal.Width (0) Petal.Length (0) Petal.Width (0.3) Species (0.3)
#> missForest iteration 1 in progress...done!
#> OOB errors MSE: 0.1227248123, 0.0910370928, 0.1806555377, 0.0450011893, 0.0339946677
#> OOB errors NMSE: 0.1801803088, 0.4824105725, 0.0583606468, 0.0781792631, 0.1531009852
#> diff. convergence measure: 0.8843598758
#> time: 0.347 seconds
#>
#> missForest iteration 2 in progress...done!
#> OOB errors MSE: 0.1154506939, 0.0839680332, 0.0606952293, 0.0360575094, 0.0317198871
#> OOB errors NMSE: 0.1695007007, 0.4449512361, 0.0196075519, 0.0626416669, 0.1428561091
#> diff. convergence measure: 0.0128912362
#> time: 0.345 seconds
#>
#> missForest iteration 3 in progress...done!
#> OOB errors MSE: 0.1172340237, 0.0822372559, 0.0584521511, 0.0370318757, 0.0314459001
#> OOB errors NMSE: 0.1721189235, 0.4357797519, 0.0188829271, 0.0643344053, 0.1416221605
#> diff. convergence measure: -0.0002293949
#> time: 0.552 seconds
# plot convergence
missForest_object$OOB_err %>%
filter(!is.na(NMSE)) %>%
ggplot(aes(iteration, NMSE, col = variable)) +
geom_point() +
geom_line()
We will further adapt the weights to be equal. You can observe that the results (number of iterations) will be different in this case.
set.seed(2022)
missForest_object <- missForestPredict::missForest(iris_mis, verbose = TRUE,
var_weights = setNames(rep(1, ncol(iris_mis)), colnames(iris_mis)), num.threads = 2)
#> Imputation sequence (missing proportion): Sepal.Length (0) Sepal.Width (0) Petal.Length (0) Petal.Width (0.3) Species (0.3)
#> missForest iteration 1 in progress...done!
#> OOB errors MSE: 0.1227248123, 0.0910370928, 0.1806555377, 0.0450011893, 0.0339946677
#> OOB errors NMSE: 0.1801803088, 0.4824105725, 0.0583606468, 0.0781792631, 0.1531009852
#> diff. convergence measure: 0.8095536447
#> time: 0.317 seconds
#>
#> missForest iteration 2 in progress...done!
#> OOB errors MSE: 0.1154506939, 0.0839680332, 0.0606952293, 0.0360575094, 0.0317198871
#> OOB errors NMSE: 0.1695007007, 0.4449512361, 0.0196075519, 0.0626416669, 0.1428561091
#> diff. convergence measure: 0.0225349023
#> time: 0.319 seconds
#>
#> missForest iteration 3 in progress...done!
#> OOB errors MSE: 0.1172340237, 0.0822372559, 0.0584521511, 0.0370318757, 0.0314459001
#> OOB errors NMSE: 0.1721189235, 0.4357797519, 0.0188829271, 0.0643344053, 0.1416221605
#> diff. convergence measure: 0.0013638193
#> time: 0.316 seconds
#>
#> missForest iteration 4 in progress...done!
#> OOB errors MSE: 0.1161599654, 0.083527038, 0.0572972705, 0.0368173119, 0.031820625
#> OOB errors NMSE: 0.1705420284, 0.4426143784, 0.0185098437, 0.0639616498, 0.1433098001
#> diff. convergence measure: -0.0012399064
#> time: 0.322 seconds
# plot convergence
missForest_object$OOB_err %>%
filter(!is.na(NMSE)) %>%
ggplot(aes(iteration, NMSE, col = variable)) +
geom_point() +
geom_line()
Post imputation, the error can be checked using the
evaluate_imputation_error
function. This can be done, of
course, only when the true values (passed via xtrue
) are
known. The errors are calculated as differences from the true
values.
As Species is a categorical variable and it is imputed with one of the classes of the variable (setosa, versicolor or virginica) and not with probabilities, only the MER (missclassification error rate) can be calculated post imputation.
By default, evaluate_imputation_error
returns the MSE
and NMSE using only the missing values, while the OOB error in the
convergence criteria is calculated using only the non-missing
values.
evaluate_imputation_error(missForest_object$ximp, iris_mis, iris)
#> variable MSE NMSE MER macro_F1 F1_score
#> 1 Petal.Length 0.00000000 0.00000000 NA NA NA
#> 2 Petal.Width 0.02809617 0.04923565 NA NA NA
#> 3 Sepal.Length 0.00000000 0.00000000 NA NA NA
#> 4 Sepal.Width 0.00000000 0.00000000 NA NA NA
#> 5 Species NA NA 0.02222222 0.9791463 NA