misaem
is a package to perform linear regression and logistic regression with missing data, under MCAR (Missing completely at random) and MAR (Missing at random) mechanisms. The covariates are assumed to be continuous variables. The methodology implemented is based on maximization of the observed likelihood using EM-types of algorithms. The package includes:
norm
package allows to estimate the mean vector and a variance covariance matrix with the EM algorithm and SWEEP operator. Finally we have reshaped them to obtain the regression coefficient.library(misaem)
Let’s generate a synthetic example of classical linear regression. We first generate a design matrix of size n=50 times p=2 by drawing each observation from a multivariate normal distribution N(μ,Σ). We consider as the true values for the parameters: μ=(1,1),Σ=[1114] Then, we generate the response according to the linear regression model with coefficient β=(2,3,−1) and variance of noise vector σ2=0.25.
set.seed(1)
50 # number of rows
n <- 2 # number of explanatory variables
p <-
# Generate complete design matrix
library(MASS)
c(1, 1)
mu.X <- matrix(c(1, 1, 1, 4), nrow = 2)
Sigma.X <- mvrnorm(n, mu.X, Sigma.X)
X.complete <-
# Generate response
c(2, 3, -1) # regression coefficient
b <- 0.25 # noise variance
sigma.eps <- cbind(rep(1, n), X.complete) %*% b + rnorm(n, 0, sigma.eps) y <-
Then we randomly introduced 15% of missing values in the covariates according to the MCAR (Missing completely at random) mechanism. To do so, we use the function ampute
from the R package mice
. For more details about how to generate missing values of different mechanisms, see the resource website of missing values Rmisstastic.
library(mice)
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
# Add missing values
ampute(data.frame(y, X.complete), 0.15, patterns = matrix(
yX.miss <-c(0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0),
ncol = 3, byrow = TRUE), freq = c(1, 1, 1, 2, 2, 2) / 9,
mech = "MCAR", bycases = FALSE)
yX.miss$amp[, 1] # responses
y.obs <- as.matrix(yX.miss$amp[, 2:3]) # covariates with NAs X.obs <-
Have a look at our synthetic dataset:
head(X.obs)
## X1 X2
## [1,] 0.30528180 -0.1473747
## [2,] 1.59950261 1.2164969
## [3,] 0.22508791 -0.5764402
## [4,] 2.86148303 3.8938533
## [5,] 0.05283648 2.0009229
## [6,] -1.07586521 -0.1496864
Check the percentage of missing values:
sum(is.na(X.obs))/(n*p)
## [1] 0.17
The main function in our package to fit linear regression with missingness is miss.lm
function. The function miss.lm
mimics the structure of widely used function lm
for the case without missing values. It takes an object of class formula
(a symbolic description of the model to be fitted) and the data frame as the input. Here we apply this function with its default options.
# Estimate regression using EM with NA
data.frame(y, X.obs)
df.obs = miss.lm(y~., data = df.obs) miss.list =
## Iterations of EM:
## 1...2...3...4...5...6...7...8...9...10...11...12...13...14...15...16...17...
Then it returns an object of self-defined class miss.lm
, which consists of the estimation of parameters, their standard error and observed log-likelihood. We can print or summarize the obtained results as follows:
print(miss.list)
##
## Call: miss.lm(formula = y ~ ., data = df.obs)
##
## Coefficients:
## (Intercept) X1 X2
## 1.942 3.052 -1.004
## Standard error estimates:
## (Intercept) X1 X2
## 0.04171 0.03484 0.01936
## Log-likelihood: 31.85
print(summary(miss.list))
##
## Call:
## miss.lm(formula = y ~ ., data = df.obs)
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 1.94205 0.04171
## X1 3.05205 0.03484
## X2 -1.00424 0.01936
## Log-likelihood: 31.852
summary(miss.list)$coef
## Estimate Std. Error
## (Intercept) 1.942050 0.04170924
## X1 3.052050 0.03483511
## X2 -1.004244 0.01936094
Self-defined parameters can be also taken such as the maximum number of iterations (maxruns
), the convergence tolerance (tol_em
) and the logical indicating if the iterations should be reported (print_iter
).
# Estimate regression using self-defined parameters
miss.lm(y~., data = df.obs, print_iter = FALSE, maxruns = 500, tol_em = 1e-4)
miss.list2 =print(miss.list2)
##
## Call: miss.lm(formula = y ~ ., data = df.obs, print_iter = FALSE, maxruns = 500,
## tol_em = 1e-04)
##
## Coefficients:
## (Intercept) X1 X2
## 1.942 3.052 -1.004
## Standard error estimates:
## (Intercept) X1 X2
## 0.04175 0.03487 0.01938
## Log-likelihood: 31.85
The function miss.lm.model.select
adapts a BIC criterion and step-wise method to return the best model selected. We add a null variable with missing values to check if the function can distinguish it from the true variables.
# Add null variable with NA
mvrnorm(n, 1, 1)
X.null <- runif(n)<0.15 # missing completely at random
patterns <- NA
X.null[patterns] <- cbind.data.frame(X.obs, X.null)
X.obs.null <-
# Without model selection
data.frame(y, X.obs.null)
df.obs.null = miss.lm(y~., data = df.obs.null) miss.list.null =
## Iterations of EM:
## 1...2...3...4...5...6...7...8...9...10...11...12...13...14...15...16...17...18...19...20...
print(miss.list.null)
##
## Call: miss.lm(formula = y ~ ., data = df.obs.null)
##
## Coefficients:
## (Intercept) X1 X2 X.null
## 1.89134 3.05640 -1.00446 0.04426
## Standard error estimates:
## (Intercept) X1 X2 X.null
## 0.05212 0.03362 0.01853 0.02713
## Log-likelihood: 10.2
# Model selection
miss.lm.model.select(y, X.obs.null)
miss.model =print(miss.model)
##
## Call: miss.lm(formula = Y ~ ., data = df, print_iter = FALSE)
##
## Coefficients:
## (Intercept) X1 X2
## 1.942 3.052 -1.004
## Standard error estimates:
## (Intercept) X1 X2
## 0.04171 0.03484 0.01936
## Log-likelihood: 31.85
In order to evaluate the prediction performance, we generate a test set of size nt=20 times p=2 following the same distribution as the previous design matrix, and we add or not 15% of missing values.
# Prediction
# Generate dataset
set.seed(200)
20 # number of new observations
nt <- mvrnorm(nt, mu.X, Sigma.X)
Xt <-
# Add missing values
ampute(data.frame(Xt), 0.15, patterns = matrix(
Xt.miss <-c(0, 1, 1, 0),
ncol = 2, byrow = TRUE), freq = c(1, 1) /2,
mech = "MCAR", bycases = FALSE)
as.matrix(Xt.miss$amp) # covariates with NAs Xt.obs <-
The prediction can be performed for a complete test set:
#train with NA + test no NA
predict(miss.list2, data.frame(Xt), seed = 100)
miss.comptest.pred =print(miss.comptest.pred)
## [1] 3.3878210 2.6112345 -0.5562864 6.5926842 2.9231974 8.0234969
## [7] 0.8286503 3.9363413 6.7515266 3.3517064 6.8156632 2.2406832
## [13] 2.0321507 5.9852215 7.8101528 5.0863422 4.2238612 4.4541193
## [19] 3.5522691 3.0003519
And we can also apply the function when both train set and test set have missing values:
#both train & test with NA
predict(miss.list2, data.frame(Xt.obs), seed = 100)
miss.pred =print(miss.pred)
## [1] 3.3878210 3.4804264 -0.5562864 6.5926842 2.9231974 8.0234969
## [7] 0.1435715 3.9363413 6.7515266 3.3517064 6.8156632 2.2406832
## [13] 2.0321507 5.9150631 7.8101528 3.5570286 4.2238612 4.4541193
## [19] 3.5522691 3.0003519
We first generate a design matrix of size n=500 times p=5 by drawing each observation from a multivariate normal distribution N(μ,Σ). Then, we generate the response according to the logistic regression model.
We consider as the true values for the parameters β=(0,1,−1,1,0,−1),μ=(1,2,3,4,5),Σ=diag(σ)Cdiag(σ), where the σ is the vector of standard deviations σ=(1,2,3,4,5)
and C the correlation matrix C=[10.80000.810000010.30.6000.310.7000.60.71].
# Generate dataset
set.seed(200)
500 # number of subjects
n <- 5 # number of explanatory variables
p <- 1:p #rep(0,p) # mean of the explanatory variables
mu.star <- 1:p # rep(1,p) # standard deviations
sd <- matrix(c( # correlation matrix
C <-1, 0.8, 0, 0, 0,
0.8, 1, 0, 0, 0,
0, 0, 1, 0.3, 0.6,
0, 0, 0.3, 1, 0.7,
0, 0, 0.6, 0.7, 1), nrow=p)
diag(sd)%*%C%*%diag(sd) # covariance matrix
Sigma.star <- c(1, -1, 1, 1, -1) # coefficients
beta.star <- 0 # intercept
beta0.star <- c(beta0.star,beta.star)
beta.true =
# Design matrix
matrix(rnorm(n*p), nrow=n)%*%chol(Sigma.star)+
X.complete <- matrix(rep(mu.star,n), nrow=n, byrow = TRUE)
# Reponse vector
1/(1+exp(-X.complete%*%beta.star-beta0.star))
p1 <- as.numeric(runif(n)<p1) y <-
Then we randomly introduced 10% of missing values in the covariates according to the MCAR (Missing completely at random) mechanism.
# Generate missingness
set.seed(200)
0.10
p.miss <- runif(n*p)<p.miss # missing completely at random
patterns <- X.complete
X.obs <- NA X.obs[patterns] <-
Have a look at our synthetic dataset:
head(X.obs)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.0847563 1.71119812 5.0779956 9.731254821 13.02285225
## [2,] 1.2264603 0.04664033 5.3758000 6.383093558 4.84730504
## [3,] 1.4325565 1.77934455 NA 8.421927692 7.26902254
## [4,] 1.5580652 5.69782193 5.5942869 -0.440749372 -0.96662931
## [5,] 1.0597553 -0.38470918 0.4462986 0.008402997 0.04745022
## [6,] 0.8853591 0.56839374 3.4641522 7.047389616 NA
The main function for fitting logistic regression with missing covariates in our package is miss.glm
function, which mimics the structure of widely used function glm
. Note that we don’t need to specify the binomial family in the input of miss.glm
function. Here we apply this function with its default options, and then we can print or summarize the obtained results as follows:
data.frame(y, X.obs)
df.obs =
#logistic regression with NA
miss.glm(y~., data = df.obs, seed = 100) miss.list =
## Iteration of SAEM:
## 50 100 150 200
print(miss.list)
##
## Call: miss.glm(formula = y ~ ., data = df.obs, seed = 100)
##
## Coefficients:
## (Intercept) X1 X2 X3 X4 X5
## -0.03659 1.50705 -1.28208 1.12342 1.03435 -1.07691
## Standard error estimates:
## (Intercept) X1 X2 X3 X4 X5
## 0.3210 0.3446 0.2056 0.1408 0.1240 0.1284
## Log-likelihood: -171.7
print(summary(miss.list))
##
## Call:
## miss.glm(formula = y ~ ., data = df.obs, seed = 100)
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.03659 0.32104
## X1 1.50705 0.34456
## X2 -1.28208 0.20560
## X3 1.12342 0.14076
## X4 1.03435 0.12396
## X5 -1.07691 0.12843
## Log-likelihood: -171.74
summary(miss.list)$coef
## Estimate Std. Error
## (Intercept) -0.03659218 0.3210369
## X1 1.50704588 0.3445570
## X2 -1.28208040 0.2056000
## X3 1.12341764 0.1407630
## X4 1.03435057 0.1239566
## X5 -1.07690679 0.1284274
To perform model selection with missing values, we adapt criterion BIC and step-wise method. The function miss.glm.model.select
outputs the best model selected. With the current implementation, when p is greater than 20, it may encounter computational difficulties for the BIC based model selection. In the following simulation, we add a null variable with missing values to check if the function can distinguish it from the true variables.
# Add null variable with NA
mvrnorm(n, 1, 1)
X.null <- runif(n)<0.10 # missing completely at random
patterns <- NA
X.null[patterns] <- cbind.data.frame(X.obs, X.null)
X.obs.null <-
# Without model selection
data.frame(y, X.obs.null)
df.obs.null = miss.glm(y~., data = df.obs.null) miss.list.null =
## Iteration of SAEM:
## 50 100 150 200
print(miss.list.null)
##
## Call: miss.glm(formula = y ~ ., data = df.obs.null)
##
## Coefficients:
## (Intercept) X1 X2 X3 X4 X5
## -0.08280 1.52860 -1.29067 1.13314 1.05171 -1.09399
## X.null
## 0.03964
## Standard error estimates:
## (Intercept) X1 X2 X3 X4 X5
## 0.3585 0.3514 0.2084 0.1417 0.1241 0.1291
## X.null
## 0.1666
## Log-likelihood: -171.4
# model selection for SAEM
miss.glm.model.select(y, X.obs.null)
miss.model =print(miss.model)
##
## Call: miss.glm(formula = Y ~ ., data = df, print_iter = FALSE, subsets = subset_choose)
##
## Coefficients:
## (Intercept) X1 X2 X3 X4 X5
## -0.06956 1.55837 -1.30913 1.14401 1.06008 -1.10143
## Standard error estimates:
## (Intercept) X1 X2 X3 X4 X5
## 0.3244 0.3500 0.2094 0.1440 0.1279 0.1317
## Log-likelihood: -172
In order to evaluate the prediction performance, we generate a test set of size nt=100 times p=5 following the same distribution as the design matrix, and without and with 10% of missing values. We evaluate the prediction quality with a confusion matrix.
# Generate test set with missingness
set.seed(200)
100
nt = matrix(rnorm(nt*p), nrow=nt)%*%chol(Sigma.star)+
X.test <- matrix(rep(mu.star,nt), nrow = nt, byrow = TRUE)
# Generate the test set
1/(1+exp(-X.test%*%beta.star-beta0.star))
p1 <- as.numeric(runif(nt)<p1)
y.test <-
# Generate missingness on test set
0.10
p.miss <-runif(nt*p)<p.miss] <- NA
X.test[
# Prediction on test set
predict(miss.list, data.frame(X.test))
pr.saem <-
# Confusion matrix
(pr.saem>0.5)*1
pred.saem =table(y.test,pred.saem )
## pred.saem
## y.test 0 1
## 0 34 8
## 1 6 52
Logistic Regression with Missing Covariates – Parameter Estimation, Model Selection and Prediction (2020, Jiang W., Josse J., Lavielle M., TraumaBase Group), Computational Statistics & Data Analysis.