MIDASim package vignette

1 Overview

The MIDASim package (v2.0) provides a flexible simulator for microbiome sequencing data, preserving both the marginal distributions and taxon–taxon correlation structure of a user-supplied template dataset. You can optionally introduce treatment or covariate effects via model parameters. Version 2.0 further improves the simulation procedure when covariates are incorporated.

1.1 Required packages

MIDASim relies on the following CRAN packages:

These packages are all available on CRAN.

1.2 Installation

Install from the local file (the .tar.gz file), which you can download from https://github.com/mengyu-he/MIDASim.

devtools::install_local("MIDASim_2.0.tar.gz", dependencies = TRUE)

A github installation is also available via

devtools::install_github("mengyu-he/MIDASim")

1.3 Open the Vignette in R

browseVignettes("MIDASim")

or

vignette("MIDASim_vignette", package = "MIDASim")

2 MIDASim: Microbiome Data Simulator

There are three main functions in MIDASim package:

Function Purpose
MIDASim.setup() Extracts taxon‑specific parameters (mean relative abundances, marginal probability of zeros, etc.) from a template count table and stores them in a simulation object.
MIDASim.modify() Adds user‑specified effects (e.g., covariate shifts, changes in library sizes) to the stored parameters.
MIDASim() Generates realistic count tables under the current parameter configuration.

2.1 Template data description

To illustrate the workflow, we use a filtered inflammatory bowel disease (IBD) 16S dataset from the Human Microbiome Project 2 (HMP2) [1]. The processed otu table can be loaded directly:

data("count.ibd")

This HMP2 data were modified through package HMP2Data on Bioconductor by filtering out samples with library sizes smaller than 3000 and taxa that appear in less than 2 samples. The filtered data have 146 rows (samples) and 614 columns (taxa).

library(HMP2Data)
IBD16S()

count.ibd <- t(IBD16S_mtx[, colSums(IBD16S_mtx)>3000] )
count.ibd <- count.ibd[, colSums(count.ibd>0)>1]

dim(count.ibd)

Additional examples datasets that come with the package include a vaginal microbiome dataset from HMP2 [1] and a upper-respiratory-tract microbiome dataset [2].

data("count.vaginal")
data("throat.otu.tab")

2.2 Setup the MIDASim model

The function MIDASim.setup() fits the underlying MIDASim model and extracts all quantities needed to simulate new count tables from a template dataset. The function offers two modeling options for relative abundances: nonparametric and parametric. Both approaches utilize a two-step procedure in which the first step models the presence-absence relationship between taxa and the second step models the non-zero values in the microbiome community. These two approaches differ by the way how the second step marginal distributions are modeled. The nonparametric approach generates data that resemble the template data better than the parametric approach. However, the parametric approach offers modifications of the template data in more controlled way.

2.2.1 Non-parametric mode

In nonparametric mode (the default), a rank based approach is used to obtain the relative abundances for taxa that are considered to be non-zero in the first step.

To fit MIDASim with nonparametric mode to the count.ibd data, use the following:

otu.tab <- count.ibd
count.ibd.setup <- MIDASim.setup(otu.tab, 
                                 mode = 'nonparametric', 
                                 n.break.ties = 10)

The argument n.break.ties specifies the number of replicates used to break ties when ranking relative abundances for each taxon, and the default is n.break.ties = 100.

2.2.2 Parametric mode

In parametric mode, MIDASim fits a generalized gamma model to the relative abundance data of each taxon. Generalized gamma model is a three-parameter distribution in the location-scale family that was proposed for analyzing right-censored survival data.

Denote the simulated relative abundance for \(i\)-th subject and \(j\)-th taxon as \(\tilde{\pi}_{ij}\). We define ``survival time’’ \[\begin{equation} \label{eq:survivalTime} \tilde{t}_{ij} = \begin{cases} \frac{1}{\tilde{\pi}_{ij}} & \text{if } \tilde{\pi}_{ij} > 0 \\ N_i & \text{if } \tilde{\pi}_{ij} = 0 \\ \end{cases} \end{equation}\] which corresponds to treating \(\tilde{t}_{ij}\) as right-censored when \(\pi_{ij}<\frac{1}{N_i}\). The generalized gamma model then assumes \(\widetilde{t}_{ij}\) has the distribution specified by \[\begin{equation}\label{eq:logLinear} \ln(\tilde{t}_{ij}) =- \mu_j + s_j\sigma_j \cdot \omega_{ij}~, \end{equation}\] where \(e^{\omega_{ij}}\) follows a gamma distribution with shape parameter \(k_j = 1/|Q_j|\) and scale parameter 1, and \(s_j=\text{sign}(Q_j)\)}.

We estimate the three parameters \(\mu_j, \sigma_j, Q_j\) in MIDASim.setup() using the relative abundances \(\pi_j\) in the template.

To fit MIDASim in parametric mode to the count.ibd data, use the following command:

count.ibd.setup <- MIDASim.setup(otu.tab, mode = 'parametric')

The resulting list of MIDASim.setup() contains

Value Description
tetra.corr tetrachoric correlation (presence/absence)
corr.rel.corrected Pearson correlation among non‑zeros
taxa.1.prop proportion of non‑zeros per taxon
sample.1.prop proportion of non‑zeros per sample
mean.rel.abund mean relative abundance per taxon
rel.abund.1 vector of non‑zero abundances (each taxon)
mu.est, sigma.est, Q.est generalized gamma parameters (parametric mode only)

2.3 Modify the fitted MIDASim model

The function MIDASim.modify() prepares (and optionally tweaks) the parameters returned by MIDASim.setup() before data generation. Note that this is a mandatory step even if no adjustment is made. The following arguments can be modified (\(n\) = target number of samples, \(p\) = the number of taxa):

Argument What it controls Expected type
lib.size Target library sizes for each sample. If NULL, original depths are reused. vector of length \(n\)
mean.rel.abund Taxon‑specific mean relative abundances. vector of length \(p\)
gengamma.mu Location shifts \(\mu_j\) for the generalized‑gamma fit. numeric vector of length \(p\)
sample.1.prop Desired proportion of non‑zero cells per sample. numeric vector of length \(n\) or single number
taxa.1.prop Desired proportion of non‑zero cells per taxon. numeric vector of length \(p\)
individual.rel.abund Individual‑by‑taxon relative‑abundance matrix, overriding mean.rel.abund. numeric \(n\)-by-\(p\) matrix

2.3.1 No modification at all

The default is no additional adjustment is wanted. The number of samples and their target library sizes for the simulated data are the same as that of the template. The following code is an example.

count.ibd.modified <- MIDASim.modify(
  count.ibd.setup,
  lib.size               = NULL,   # keep original depths
  mean.rel.abund         = NULL,   # keep original means
  gengamma.mu            = NULL,   # keep original location parameters
  sample.1.prop          = NULL,   # keep original sparsity per sample
  taxa.1.prop            = NULL,   # keep original sparsity per taxon
  individual.rel.abund   = NULL,   # keep original individual relative abundances
)

Next, we will explain how to modify the data features to accommodate a wide range of simulation scenarios, separately for the nonparametric and parametric modes.

2.3.2 Model modification in parametric mode

When MIDASim.setup() is called with mode = "parametric", users can modify:

  • Sample size and library sizes
  • Taxon‑level location parameters (\(\mu_j\))
  • Mean or individual‑level relative abundances

Changes in the parameters of the parametric model (including library sizes) imply coordinated changes in other quantities. After either modification of the parameters, we predict the probability of being non-zero of \(i\)-th subject and \(j\)-th taxon by

\[\begin{equation} \label{gengamma_pred01} P(\widetilde{Z}_{ij} = 1) = F_j(N_i ~;~ \widehat{\mu}_{ij}, \widehat{\sigma},_j \widehat{Q}_j), \end{equation}\] to obtain the number of non-zero cells in each subject \(Z_{i\cdot }\) and that in each taxon \(Z_{\cdot j}\).

2.3.2.1 Modify the number of samples and their library sizes

To change the library sizes, set the lib.size argument equal to a vector with the target library sizes. The length of this vector becomes the new number of samples. The advantage of the parametric mode is that it handles coordinated changes automatically. As a result, when changing library sizes in parametric mode, there’s no need for the user to provide the marginal totals of non-zero cells explicitly.

For example, to generate data having library sizes that are uniformly distributed between 1,000 and 10,000 while changing the number of samples to 700, we execute the code

new.lib.size <- sample(1000:10000, size=700, replace=TRUE)

count.ibd.modified <- MIDASim.modify(
  count.ibd.setup, 
  lib.size = new.lib.size   # other arguments left NULL
  )

2.3.2.2 Modify taxa features by mean relative abundances or location parameters

Specifying new values for mean relative abundances (mean.rel.abund) or location parameters (gengamma.mu) can modify taxa features.

You can alter taxon‑level characteristics in either of two equivalent ways:

  • mean.rel.abund provide supply a new vector of target mean relative abundances. MIDASim will calculate the corresponding location parameters of the generalized‑gamma model, then updates all downstream probabilities and sparsity counts.
  • gengamma.mu - provide the mean specification and shift \(\mu_j\) directly. This is mathematically identical to changing the means on the log scale.

Because the two arguments modify the same quantity, specify one or the other, not both.

For example, if we want to increase \(log(p_j)\) by \(\beta=0.1\) for the first 10 taxa (e.g. modify these taxa according to a compositional model), the following is the code for generating a set of target relative abundances as described,

beta <- 0.1
new.mean.rel.abund <- count.ibd.setup$mean.rel.abund
new.mean.rel.abund[1:10] <- exp(beta) * new.mean.rel.abund[1:10]
new.mean.rel.abund <- new.mean.rel.abund / sum(new.mean.rel.abund)  # renormalize

then the target taxa relative abundances can be used in MIDASim.modify() as the following,

count.ibd.modified <- MIDASim.modify(
  count.ibd.setup, 
  mean.rel.abund = new.mean.rel.abund
  )

Alternatively, the users can also directly change the location parameters \(\mu_j\) by

\[\mu_j \rightarrow \mu_j + \beta \]

count.ibd.modified <- MIDASim.modify(
  count.ibd.setup, 
  gengamma.mu = count.ibd.setup$mu.est + beta
  )

2.3.2.3 Modify taxa features by individual‑specific relative abundances

MIDASim v2.0 allows you to pass an individual-by-taxon matrix of expected relative abundances (individual.rel.abund). This is the most straightforward way to encode subject‑level covariate effects (binary or continuous, one or many) in a single step.

Illustration with two covariates:

  • \(Y1\) is binary: 50 controls (\(Y=0\)) and 50 cases (\(Y=1\)). Cases get a uniform 0.1 log-fold increase in the first 10 taxa.
  • \(Y2\) is continuous: 100 draws from a standard normal. Each unit increase in \(Y2\) affects the second 10 taxa (11-20) by taxon-specific log-fold changes sampled from uniform distribution, Unif(−1, 1).
Y1 <- rep(c(0, 1), each = 50)  # binary case-control
Y2 <- rnorm(100)   # continuous
Y.all <- cbind(Y1, Y2)

n.taxa <- length(count.ibd.setup$mean.rel.abund)
beta1 <- beta2 <- rep(0, n.taxa) # effect on each taxon
beta1[1:10] <- rep(0.1, 10)
beta2[11:20] <- runif(10, -1, 1)
beta.all <- cbind(beta1, beta2)

new.individual.rel.abund <- matrix(count.ibd.setup$mean.rel.abund, 
                                   nrow = 100, ncol = n.taxa, byrow = T)
new.individual.rel.abund <- exp(Y.all %*% t(beta.all))  * new.individual.rel.abund  # introduce signals
new.individual.rel.abund <- new.individual.rel.abund/rowSums(new.individual.rel.abund)  # normalize

Feed this matrix to MIDASim.modify() as the following,

count.ibd.modified <- MIDASim.modify(
  count.ibd.setup, 
  individual.rel.abund = new.individual.rel.abund
  )

The resulting setup object now carries subject‑specific compositions that reflect the desired subject-level signals and is ready for data generation with MIDASim().

2.4 Simulate microbiome data with the fitted MIDASim model

The function MIDASim() takes the quantities obtained from MIDASim.modify() and simulates microbiome data. The output is comprised of the simulated 0/1 (presence-absence) data, sim_rel, the table of relative abundances, and sim_count, the table of count data.

simulated.data <- MIDASim(count.ibd.modified)
summary(simulated.data)

2.5 References

[1] Lloyd-Price, J., Arze, C., Ananthakrishnan, A.N. et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569, 655–662 (2019). https://doi.org/10.1038/s41586-019-1237-9.

[2] Charlson, E.S., Chen, J., Custers-Allen, R. et al. Disordered microbial communities in the upper respiratory tract of cigarette smokers. PloS one, 5(12), e15216 (2010). https://doi.org/10.1371/journal.pone.0015216