| Title: | Power and Sample Size Calculation for Differential Abundance Microbiome Studies |
|---|---|
| Description: | Provides functions for estimating statistical power and required sample sizes in differential abundance microbiome studies using negative binomial models. The methods are based on Agronah and Bolker (2025) <doi:10.1371/journal.pone.0318820>. The package includes tools for simulation-based power analysis and sample size estimation using generalized additive models (GAMs), and visualization utilities for exploring the relationship between power, effect size, abundance, and sample size. |
| Authors: | Michael Agronah [aut, cre] (ORCID: <https://orcid.org/0009-0007-2655-4094>), Ben Bolker [aut] (ORCID: <https://orcid.org/0000-0002-2127-0443>) |
| Maintainer: | Michael Agronah <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-06-16 09:52:20 UTC |
| Source: | https://github.com/magronah/power.nb |
Contour plot for showing predicted power
contour_plot_fun( combined_data, power_estimate, cont_breaks, multiple_samples = FALSE )contour_plot_fun( combined_data, power_estimate, cont_breaks, multiple_samples = FALSE )
combined_data |
data used for fitting gam |
power_estimate |
predicted power |
cont_breaks |
breaks for contour plot |
multiple_samples |
not currently used |
ggplot2 object
This function simulates count data for microbiome studies based on log mean, log fold change, and dispersion parameters. It supports generating data for multiple simulations and allows flexibility in specifying the number of control and treatment samples or samples per group.
countdata_sim_fun( logmean_param, logfoldchange_param, dispersion_param, nsamp_per_group = NULL, ncont = NULL, ntreat = NULL, notu, nsim = 1, disp_scale = 0.3, max_lfc = 15, maxlfc_iter = 1000, seed = NULL )countdata_sim_fun( logmean_param, logfoldchange_param, dispersion_param, nsamp_per_group = NULL, ncont = NULL, ntreat = NULL, notu, nsim = 1, disp_scale = 0.3, max_lfc = 15, maxlfc_iter = 1000, seed = NULL )
logmean_param |
A list of parameters for simulating the log mean abundance. |
logfoldchange_param |
A list of parameters for simulating log fold change, containing:
|
dispersion_param |
A list of dispersion parameters containing:
|
nsamp_per_group |
Number of samples per group (control and treatment). If provided,
|
ncont |
Number of control samples. Specify along with |
ntreat |
Number of treatment samples. Specify along with |
notu |
Number of operational taxonomic units (OTUs) to simulate. |
nsim |
Number of simulations to run. Default is 1. |
disp_scale |
Scale parameter for the dispersion. Default is 0.3. |
max_lfc |
Maximum allowable log fold change. Default is 15. |
maxlfc_iter |
Maximum number of iterations for ensuring log fold change is within |
seed |
Seed value for reproducibility. Default is |
A list containing:
countdata_list: A list of count data matrices for each simulation.
metadata_list: A list of metadata data frames for each simulation.
logmean_list: A list of log mean vectors for each simulation.
logfoldchange_list: A list of log fold change vectors for each simulation.
treat_countdata_list: A list of treatment count data matrices for each simulation.
control_countdata_list: A list of control count data matrices for each simulation.
# Load required packages library(foreach) library(doParallel) # Define parameters logmean_param <- list(mu = 0, sigma = 1) logfoldchange_param <- list(par = rnorm(11), np = 2, sd_ord = 2) dispersion_param <- list(asymptDisp = 0.1, extraPois = 0.05) # Simulate count data result <- countdata_sim_fun( logmean_param = logmean_param, logfoldchange_param = logfoldchange_param, dispersion_param = dispersion_param, nsamp_per_group = 10, notu = 50, nsim = 2, seed = 123 ) # Access simulation results countdata <- result$countdata_list[[1]] metadata <- result$metadata_list[[1]]# Load required packages library(foreach) library(doParallel) # Define parameters logmean_param <- list(mu = 0, sigma = 1) logfoldchange_param <- list(par = rnorm(11), np = 2, sd_ord = 2) dispersion_param <- list(asymptDisp = 0.1, extraPois = 0.05) # Simulate count data result <- countdata_sim_fun( logmean_param = logmean_param, logfoldchange_param = logfoldchange_param, dispersion_param = dispersion_param, nsamp_per_group = 10, notu = 50, nsim = 2, seed = 123 ) # Access simulation results countdata <- result$countdata_list[[1]] metadata <- result$metadata_list[[1]]
Fold change and p-value estimations for simulations
deseq_fun_est( metadata_list, countdata_list, alpha_level = 0.1, group_colname, sample_colname, num_cores = 1, ref_name = NULL )deseq_fun_est( metadata_list, countdata_list, alpha_level = 0.1, group_colname, sample_colname, num_cores = 1, ref_name = NULL )
metadata_list |
: list of metadata |
countdata_list |
: list of otu count data |
alpha_level |
The significance level for determining differential expression. Default is 0.1. |
group_colname |
column names of the groups or conditions |
sample_colname |
column names of the samples |
num_cores |
: number of cores |
ref_name |
reference level for fold change calculation. If NULL, the reference level is determined automatically — by default, the factor level that comes first is used as the reference. |
A list logfoldchange log fold change estimates
logmean is the log mean count for taxa (arithmetic mean for taxa across all subjects)
dispersion: dispersion estimates for each taxa
deseq_estimate is a dataframe containing results from deseq baseMean,log2FoldChange, lfcSE, pvalue, padj
normalised_count is the normalised count data
DESeq2.This function estimates log fold changes (LFC) for microbiome count data using the DESeq2 package. It is strongly recommended to keep the following defaults:
minReplicatesForReplace=Inf
cooksCutoff=TRUE
independentFiltering=TRUE
These options are particularly useful when estimating fold changes to fit the mixture of Gaussian distributions.
deseqfun( countdata, metadata, alpha_level = 0.1, ref_name = NULL, group_colname, sample_colname, minReplicatesForReplace = Inf, cooksCutoff = TRUE, independentFiltering = TRUE, shrinkage_method = "normal" )deseqfun( countdata, metadata, alpha_level = 0.1, ref_name = NULL, group_colname, sample_colname, minReplicatesForReplace = Inf, cooksCutoff = TRUE, independentFiltering = TRUE, shrinkage_method = "normal" )
countdata |
A matrix of OTU count data where rows represent taxa and columns represent samples. |
metadata |
A dataframe containing sample information with two rows: one for sample names and one for group names. |
alpha_level |
The significance level for determining differential expression. Default is 0.1. |
ref_name |
reference level for fold change calculation. If NULL, the reference level is determined automatically — by default, the factor level that comes first is used as the reference. |
group_colname |
column names of the groups or conditions |
sample_colname |
column names of the samples |
minReplicatesForReplace |
DESeq2's parameter to control the minimum number of replicates required for replacing outliers during dispersion estimation. Default is |
cooksCutoff |
DESeq2's parameter for removing outliers based on Cook's distance. Default is |
independentFiltering |
DESeq2's parameter for independent filtering. Default is |
shrinkage_method |
DESeq2's shrinkage method for fold changes. Default is |
A list containing the following elements:
logfoldchange: A vector of log fold change estimates for each taxa.
dispersion: A vector of dispersion estimates for each taxa.
deseq_estimate: A dataframe containing DESeq2 results, including baseMean, log2FoldChange, lfcSE, pvalue, and padj.
normalised_count: A matrix of normalized count data.
dds: DESeq object
# Example usage set.seed(101) nr = 10; nc = 35 countdata <- matrix(rpois(350, 3), ncol = nc, nrow = nr) # Simulated OTU count data with 50 taxa and 10 samples countdata <- as.data.frame(countdata) rownames(countdata) <- paste("Sample", 1:nr, sep = "_") colnames(countdata) <- paste("otu", 1:nc, sep = "_") metadata <- data.frame(Samples = paste("Sample", 1:nr, sep = "_"), Groups = rep(c("Control", "Treatment"), each = 5)) sample_colname = "Samples" group_colname = "Groups" result <- deseqfun(countdata, metadata, ref_name = "Control", minReplicatesForReplace = Inf, cooksCutoff = TRUE, sample_colname = "Samples", group_colname = "Groups", independentFiltering = TRUE, shrinkage_method="normal") # Examine the results result$logfoldchange # Log fold changes result$logmean # Log mean counts result$dispersion # Dispersion estimates result$deseq_estimate # DESeq2 results result$normalised_count # Normalized count data# Example usage set.seed(101) nr = 10; nc = 35 countdata <- matrix(rpois(350, 3), ncol = nc, nrow = nr) # Simulated OTU count data with 50 taxa and 10 samples countdata <- as.data.frame(countdata) rownames(countdata) <- paste("Sample", 1:nr, sep = "_") colnames(countdata) <- paste("otu", 1:nc, sep = "_") metadata <- data.frame(Samples = paste("Sample", 1:nr, sep = "_"), Groups = rep(c("Control", "Treatment"), each = 5)) sample_colname = "Samples" group_colname = "Groups" result <- deseqfun(countdata, metadata, ref_name = "Control", minReplicatesForReplace = Inf, cooksCutoff = TRUE, sample_colname = "Samples", group_colname = "Groups", independentFiltering = TRUE, shrinkage_method="normal") # Examine the results result$logfoldchange # Log fold changes result$logmean # Log mean counts result$dispersion # Dispersion estimates result$deseq_estimate # DESeq2 results result$normalised_count # Normalized count data
Dispersion are estimated from the DESeq2 package. The function fitted is of the
form a + b/(mean count) where a represents the asymptotic dispersion level
for high abundance taxa, and b captures additional dispersion variability.
dispersion_fit(dispersion, logmean)dispersion_fit(dispersion, logmean)
dispersion |
dispersion estimates from deseq |
logmean |
vector of log mean abundance |
A list containing estimates for a and b and confidence intervals
logmean = rnorm(100) dispersion = abs(rnorm(100)) dispersion_fit(dispersion,logmean)logmean = rnorm(100) dispersion = abs(rnorm(100)) dispersion_fit(dispersion,logmean)
This function calculates the dispersion value for microbiome data based on the provided parameters: mean abundance, asymptotic dispersion, and extra Poisson dispersion.
dispersion_fun(mean_abund, asymptDisp, extraPois)dispersion_fun(mean_abund, asymptDisp, extraPois)
mean_abund |
Numeric value representing the mean abundance of the taxa. |
asymptDisp |
Numeric value for the asymptotic dispersion (the dispersion at high abundance). |
extraPois |
Numeric value for the extra Poisson dispersion (to model overdispersion). |
The dispersion is calculated using the formula:
A numeric value representing the dispersion.
mean_abund <- 10 asymptDisp <- 0.1 extraPois <- 0.05 dispersion_fun(mean_abund, asymptDisp, extraPois)mean_abund <- 10 asymptDisp <- 0.1 extraPois <- 0.05 dispersion_fun(mean_abund, asymptDisp, extraPois)
This function calculates the density of a normal mixture model for a given vector of parameters on the unconstrained scale (softmax(prob), mean, log(sd)).
dnormmix(x, par, logmean, ..., log = FALSE)dnormmix(x, par, logmean, ..., log = FALSE)
x |
Numeric vector of values at which to evaluate the density. |
par |
A vector of parameters on the unconstrained scale, including:
|
logmean |
Numeric value representing the log of the mean parameter. |
... |
Additional arguments passed to the |
log |
Logical. If |
A numeric vector of density values (or log-density values if log = TRUE) for the mixture model.
# Example parameters x <- seq(-3, 3, length.out = 100) ## par <- c(-0.5, 0.5, log(0.8), log(1.2)) # Example: softmax probabilities, mean, log(sd) set.seed(101); par <- rnorm(11) logmean <- rep(0.1, length(x)) ## constant log mean # Calculate density density <- dnormmix(x, par, logmean) # Calculate log-density log_density <- dnormmix(x, par, logmean, log = TRUE)# Example parameters x <- seq(-3, 3, length.out = 100) ## par <- c(-0.5, 0.5, log(0.8), log(1.2)) # Example: softmax probabilities, mean, log(sd) set.seed(101); par <- rnorm(11) logmean <- rep(0.1, length(x)) ## constant log mean # Calculate density density <- dnormmix(x, par, logmean) # Calculate log-density log_density <- dnormmix(x, par, logmean, log = TRUE)
takes pars as three vectors (prob, mean, sd), on constrained scale
dnormmix0(x, probs, muvals, sdvals, log = FALSE)dnormmix0(x, probs, muvals, sdvals, log = FALSE)
x |
vector |
probs |
mixture proportions |
muvals |
values of the mean |
sdvals |
values for standard deviataions |
log |
log scale |
likelihood
Filter to retain only taxa with at least abund_thresh counts in at least sample_thresh samples
filter_low_count( countdata, metadata, abund_thresh = 5, sample_thresh = 3, sample_colname, group_colname )filter_low_count( countdata, metadata, abund_thresh = 5, sample_thresh = 3, sample_colname, group_colname )
countdata |
otu table |
metadata |
dataframe with 2 rows sample names and group names |
abund_thresh |
minimum number of taxa abundance threshold |
sample_thresh |
minimum number of sample threshold |
sample_colname |
column names of the samples |
group_colname |
column names of the groups or conditions |
filtered otu count data
Title
gam_fit( deseq_est_list, true_lfoldchange_list, true_lmean_list, grid_len = 50, alpha_level = 0.1 )gam_fit( deseq_est_list, true_lfoldchange_list, true_lmean_list, grid_len = 50, alpha_level = 0.1 )
deseq_est_list |
a list containing fold change, pvalues and other estimates from |
true_lfoldchange_list |
list containing simulated log fold change used for simulating the count data |
true_lmean_list |
list containing simulated log mean count used for simulating the count data |
grid_len |
number of grids for |
alpha_level |
significance level for power calculations |
A list fit_2d is the fitted scam object
power_estimate predicted power estimated using fit_2d
combined_data tibble containing pvlaues and used for GAM fit
This function generates parameter names for a Gaussian mixture model based on the number
of components (np) and the order of the polynomial function (sd_ord)
used to model the standard deviation parameters.
gen_parnames(np, sd_ord)gen_parnames(np, sd_ord)
np |
Integer. The number of Gaussian components in the mixture model. |
sd_ord |
Integer. The order of the polynomial function used to model the standard deviation parameters. Possible values are:
|
A character vector of parameter names, including:
Logit-transformed probabilities (logitprob_1, ..., logitprob_(np-1)).
Mean parameters (mu_int_1, mu_slope_1, ..., for each component).
Log-transformed standard deviations (logsd_.1_1, logsd_.L_1, ...,
depending on sd_ord and np).
# Generate parameter names for a 3-component mixture with linear standard deviation function gen_parnames(np = 3, sd_ord = 1) # Generate parameter names for a 4-component mixture with quadratic standard deviation function gen_parnames(np = 4, sd_ord = 2)# Generate parameter names for a 3-component mixture with linear standard deviation function gen_parnames(np = 3, sd_ord = 1) # Generate parameter names for a 4-component mixture with quadratic standard deviation function gen_parnames(np = 4, sd_ord = 2)
generate normal mixture parameters (prob vector, mean vector, sd vector for a specified set of 'x' values (logmean)
genmixpars(x, pars, np = 2, sd_ord = 2)genmixpars(x, pars, np = 2, sd_ord = 2)
x |
independent variable |
pars |
parameter vector: first logit-probs (np-1), then mean parameters (2 per component: intercepts, slopes), then var parameters (varord + 1 per component: intercepts, slopes, quad coeffs, etc.) |
np |
number of components in mixture |
sd_ord |
order of logsd model (2 = quadratic) |
A list
probs: mixture proportions () muvals: mean values sdvals: standard deviation values
The standard deviation parameters are modeled either by linear or quadratic functions of log mean count and the mean parameter is modeled by linear functions of log mean count
logfoldchange_fit( logmean, logfoldchange, ncore = 2, max_sd_ord = 2, max_np = 5, minval = -5, maxval = 5, itermax = 100, NP = 800, seed = 100 )logfoldchange_fit( logmean, logfoldchange, ncore = 2, max_sd_ord = 2, max_np = 5, minval = -5, maxval = 5, itermax = 100, NP = 800, seed = 100 )
logmean |
vector of log mean abundance |
logfoldchange |
vector of log fold change |
ncore |
number of cores to use |
max_sd_ord |
the maximum order of polynomial function to fit to standard deviation parameter. This must be either 1 (linear) or 2(quadratic) |
max_np |
maximum number of Gaussian components to check for |
minval |
minimum value for DEoptim search |
maxval |
maximum value for DEoptim search |
itermax |
maximum number of iterations |
NP |
the number of population members for DEoptim |
seed |
seed value |
A list.
par is a vector of the estimates of the mixture proportion, the mean and standard deviation parameters, np is the number of gaussian components fitted sd_ord is the order for the function for the standard deviation aic is the aic of the best fit
logmean = rnorm(100) logfoldchange = rnorm(100) logfoldchange_fit(logmean,logfoldchange)logmean = rnorm(100) logfoldchange = rnorm(100) logfoldchange_fit(logmean,logfoldchange)
This function generates simulated log fold change (LFC) values based on the provided log mean abundance and LFC parameters. The simulation ensures that the generated LFC values remain within a specified maximum range by iterating until convergence or until a maximum iteration limit is reached.
logfoldchange_sim_fun( logmean_sim, logfoldchange_param, max_lfc = 15, max_iter = 10000, seed = 121 )logfoldchange_sim_fun( logmean_sim, logfoldchange_param, max_lfc = 15, max_iter = 10000, seed = 121 )
logmean_sim |
A numeric vector of simulated log mean abundances. |
logfoldchange_param |
A list containing parameters for the log fold change simulation:
|
max_lfc |
A numeric value specifying the maximum allowable absolute log fold change value. Default is 15. |
max_iter |
An integer specifying the maximum number of iterations allowed to ensure all simulated LFC values are within the |
seed |
random-number seed |
A numeric vector of simulated log fold change values (lfc).
set.seed(101) # Define simulated log mean abundance logmean_sim <- rnorm(100, mean = 0, sd = 1) # Define parameters for log fold change simulation logfoldchange_param <- list( par = rnorm(11), # Example parameters np = 2, # Number of components sd_ord = 2 # Order of polynomial for SD ) # Simulate log fold change values logfoldchange_sim_fun( logmean_sim = logmean_sim, logfoldchange_param = logfoldchange_param, max_lfc = 10, max_iter = 5000 )set.seed(101) # Define simulated log mean abundance logmean_sim <- rnorm(100, mean = 0, sd = 1) # Define parameters for log fold change simulation logfoldchange_param <- list( par = rnorm(11), # Example parameters np = 2, # Number of components sd_ord = 2 # Order of polynomial for SD ) # Simulate log fold change values logfoldchange_sim_fun( logmean_sim = logmean_sim, logfoldchange_param = logfoldchange_param, max_lfc = 10, max_iter = 5000 )
The optimal number of components to fit is chosen using parametric bootstrap method
logmean_fit(logmean, sig = 0.05, max.comp = 4, max.boot = 100, verb = FALSE)logmean_fit(logmean, sig = 0.05, max.comp = 4, max.boot = 100, verb = FALSE)
logmean |
vector of log mean count of taxa |
sig |
significance level to compare against p-value to be used for parametric bootstrap calculation |
max.comp |
maximum number of Gaussian components to compare sequentially |
max.boot |
maximum number of bootstraps simulations |
verb |
If TRUE, it prints out updates of iterations of the algorithm |
A list containing the optimal number of Gaussian components fitted; and mean and variance parameter estimates from the fit
logmean = rnorm(100) logmean_fit(logmean,sig=0.05,max.comp=4,max.boot=100)logmean = rnorm(100) logmean_fit(logmean,sig=0.05,max.comp=4,max.boot=100)
This function generates log means for a specified number of OTUs (Operational Taxonomic Units) based on the provided parameters. If a single mean is specified, the log means are drawn from a normal distribution. If multiple means and corresponding weights are specified, the log means are drawn from a mixture of normal distributions.
logmean_sim_fun(logmean_param, notu)logmean_sim_fun(logmean_param, notu)
logmean_param |
A list containing the parameters for the distribution:
|
notu |
An integer specifying the number of OTUs to simulate. |
A numeric vector of simulated log means for the specified number of OTUs.
# Example 1: Single normal distribution params_single <- list(mu = 0, sigma = 1) logmean_sim_fun(logmean_param = params_single, notu = 100) # Example 2: Mixture of normal distributions params_mixture <- list( mu = c(-1, 1), sigma = c(0.5, 0.5), lambda = c(0.4, 0.6) ) logmean_sim_fun(logmean_param = params_mixture, notu = 100)# Example 1: Single normal distribution params_single <- list(mu = 0, sigma = 1) logmean_sim_fun(logmean_param = params_single, notu = 100) # Example 2: Mixture of normal distributions params_mixture <- list( mu = c(-1, 1), sigma = c(0.5, 0.5), lambda = c(0.4, 0.6) ) logmean_sim_fun(logmean_param = params_mixture, notu = 100)
Simulating from a mixture of Gaussian
myrnormmix(par, logmean, ...)myrnormmix(par, logmean, ...)
par |
parameters (mean, standard deviation and mixture proportion of the mixture of Gaussian) |
logmean |
log mean count of taxa |
... |
other parameters taken by |
random values from a a mixture of Gaussian
Objective function
nllfun(par, vals, logmean, np, sd_ord)nllfun(par, vals, logmean, np, sd_ord)
par |
parameters of the mixture of Gaussian |
vals |
values of log fold change |
logmean |
log mean count |
np |
number of Gaussian components |
sd_ord |
order of polynomial function to model standard deviation parameters (1 - linear function and 2- quad) |
value for the objective function
The number of gaussian components is determined using the using parametric bootstrap
optimal.comp(logmean, sig = 0.05, max.comp = 4, max.boot = 100)optimal.comp(logmean, sig = 0.05, max.comp = 4, max.boot = 100)
logmean |
vector of log mean abundances of taxa |
sig |
significance level to compare against p-value |
max.comp |
maximum number of Gaussian components to compare sequentially |
max.boot |
maximum number of bootstraps simulations |
The best number of components for fitting the distribution of log mean abundance
logmean = rnorm(100) optimal.comp(logmean,sig=0.05,max.comp=4,max.boot=100)logmean = rnorm(100) optimal.comp(logmean,sig=0.05,max.comp=4,max.boot=100)
General-purpose log-likelihood function, vectorized sum(pars*x^i)
polyfun(pars, x)polyfun(pars, x)
pars |
parameters |
x |
log mean count |
values representing output for the polynomial fuction (f(x))
polyfun(pars = c(1, 2, 3), x = 1:5) polyfun(pars = c(1, 0, 3), x = 1)polyfun(pars = c(1, 2, 3), x = 1:5) polyfun(pars = c(1, 0, 3), x = 1)
Fits a shape-constrained additive model (SCAM) to estimate statistical power as a function of mean abundance, absolute log fold change, and sample size. The function takes p-values obtained across simulation settings, converts them to rejection indicators based on a chosen significance level, and then models the probability of rejection using smooth terms.
power_fun_ss( pval_est_list, logmean_list, nsample_vec, logfoldchange_list, alpha_level = 0.1 )power_fun_ss( pval_est_list, logmean_list, nsample_vec, logfoldchange_list, alpha_level = 0.1 )
pval_est_list |
A list of p-value vectors obtained from estimating the fold change estimates from the simulated datasets across different sample sizes. Each p-value corresponds to a test result for a particular combination of mean abundance, log fold change, and sample size. |
logmean_list |
A list containing the simulated log mean abundance values
corresponding to the p-values in |
nsample_vec |
A numeric vector of sample sizes used in the simulations. |
logfoldchange_list |
A list containing the simulated log fold change values
corresponding to the p-values in |
alpha_level |
Numeric significance level used to define rejection of the
null hypothesis. Default is |
The returned model can be used as a smooth approximation to the empirical power surface, which is useful for interpolation and sample size prediction.
The function first pools all p-values from pval_est_list and converts them
into binary rejection indicators:
where alpha_level is the chosen significance threshold.
A data frame is then constructed with:
logmean: log mean abundance,
abs_lfc: absolute log fold change,
pval_reject: binary rejection indicator,
sample_size: sample size,
logsample_size: base-2 logarithm of sample size.
The function attempts to fit the following SCAM model:
using a binomial family.
If the full model fails due to numerical instability, a simpler fallback model is fitted instead:
A warning is issued when the fallback model is used.
A list with two components:
combined_data: A tibble containing the combined simulation inputs
and rejection indicators used for model fitting.
gam_mod: The fitted scam model object.
#' @examples # Example structure only set.seed(101) n = 70 pval_est_list = list(rnorm(n),rnorm(n)) logmean_list = list(rnorm(n),rnorm(n)) logfoldchange_list = list(rnorm(n),rnorm(n)) nsample_vec <- c(20, 40) out <- power_fun_ss( pval_est_list = pval_est_list, logmean_list = logmean_list, nsample_vec = nsample_vec, logfoldchange_list = logfoldchange_list, alpha_level = 0.1 ) out$combined_data out$gam_mod#' @examples # Example structure only set.seed(101) n = 70 pval_est_list = list(rnorm(n),rnorm(n)) logmean_list = list(rnorm(n),rnorm(n)) logfoldchange_list = list(rnorm(n),rnorm(n)) nsample_vec <- c(20, 40) out <- power_fun_ss( pval_est_list = pval_est_list, logmean_list = logmean_list, nsample_vec = nsample_vec, logfoldchange_list = logfoldchange_list, alpha_level = 0.1 ) out$combined_data out$gam_mod
This function extracts a specific component (data) from a list of datasets.
The component to extract is specified by the extract_name parameter, and
the function returns a list containing the extracted data from each dataset.
read_data(dataset_list, extract_name)read_data(dataset_list, extract_name)
dataset_list |
A list of datasets from which data will be extracted. Each element of the list is assumed to be a dataset (typically a list or dataframe). |
extract_name |
A string representing the name of the component or column
to be extracted from each dataset in the |
A list containing the extracted data. Each element corresponds to
the extracted component from the datasets in dataset_list. The names of
the list elements are taken from the names of dataset_list.
# Example dataset list dataset1 <- list(countdata = matrix(1:9, nrow = 3), metadata = data.frame(id = 1:3)) dataset2 <- list(countdata = matrix(10:18, nrow = 3), metadata = data.frame(id = 4:6)) dataset_list <- list(dataset1 = dataset1, dataset2 = dataset2) # Extract 'countdata' from each dataset in the list result <- read_data(dataset_list, "countdata") print(result)# Example dataset list dataset1 <- list(countdata = matrix(1:9, nrow = 3), metadata = data.frame(id = 1:3)) dataset2 <- list(countdata = matrix(10:18, nrow = 3), metadata = data.frame(id = 4:6)) dataset_list <- list(dataset1 = dataset1, dataset2 = dataset2) # Extract 'countdata' from each dataset in the list result <- read_data(dataset_list, "countdata") print(result)
general-purpose normal-mixture deviate generator: takes matrices of probabilities, means, sds
rnormmix0(n, probs, muvals, sdvals)rnormmix0(n, probs, muvals, sdvals)
n |
number of observations |
probs |
mixture proportions |
muvals |
values for the mean |
sdvals |
values for the standard deviation |
simulations
This function estimates the sample size required to achieve a specified target power using predictions from a fitted GAM/SCAM power model. The function evaluates predicted power across a grid of candidate sample sizes and identifies the smallest sample size for which the predicted power reaches or exceeds the target value. Linear interpolation is then used on the log2(sample size) scale to obtain a more precise estimate.
sample_size_ss_interp( target_power, logmean, abs_lfc, model, xmin = log2(5), xmax = log2(500), ngrid = 1000 )sample_size_ss_interp( target_power, logmean, abs_lfc, model, xmin = log2(5), xmax = log2(500), ngrid = 1000 )
target_power |
Numeric value specifying the desired statistical power. |
logmean |
Numeric value representing the log mean abundance. |
abs_lfc |
Numeric value representing the absolute log fold change. |
model |
A fitted GAM/SCAM model used to predict statistical power. |
xmin |
Numeric value giving the minimum log2(sample size) considered
in the search. Default is |
xmax |
Numeric value giving the maximum log2(sample size) considered
in the search. Default is |
ngrid |
Integer specifying the number of grid points used when
searching for the sample size solution. Default is |
The function first constructs a grid of candidate sample sizes on the
log2 scale between xmin and xmax. Predicted power values are
then obtained from the fitted model for each grid point. The smallest
sample size at which the predicted power reaches the target value is
identified, and linear interpolation is used to refine the estimate.
A numeric value representing the estimated sample size required
to achieve the target power. Returns NA if the target power is
not reached within the specified search range.
This function estimates the sample size required to achieve a specified target statistical power using a fitted GAM/SCAM power model. The function first attempts to solve for the sample size using a root-finding algorithm. If the root-finding procedure fails, a grid-based interpolation method is used as a fallback to obtain an approximate solution.
ss_solver( target_power, logmean, abs_lfc, model, xmin = log2(5), xmax = log2(500) )ss_solver( target_power, logmean, abs_lfc, model, xmin = log2(5), xmax = log2(500) )
target_power |
Numeric value specifying the desired statistical power. |
logmean |
Numeric value representing the log mean abundance. |
abs_lfc |
Numeric value representing the absolute log fold change. |
model |
A fitted GAM/SCAM model used to predict statistical power. |
xmin |
Numeric value giving the minimum log2(sample size) considered
in the search. Default is |
xmax |
Numeric value giving the maximum log2(sample size) considered
in the search. Default is |
The function first attempts to compute the required sample size using a
root-finding approach implemented in uniroot_ss(). If this method
fails (for example, due to numerical issues or if the root cannot be
bracketed within the specified interval), the function falls back to
a grid-based interpolation approach implemented in
sample_size_ss_interp().
A warning is issued if the target power is specified as 0 or 1, since these values correspond to unrealistic design targets in most practical applications.
A numeric value representing the estimated sample size required to achieve the target power.
Sample Size estimation function using uniroot
uniroot_ss( target_power, logmean, abs_lfc, model, xmin = log2(10), xmax = log2(5000), maxiter = 10000, max_report = 2000 )uniroot_ss( target_power, logmean, abs_lfc, model, xmin = log2(10), xmax = log2(5000), maxiter = 10000, max_report = 2000 )
target_power |
Numeric value specifying the desired statistical power. |
logmean |
Numeric value representing the log of the mean abundance. |
abs_lfc |
Numeric value representing the absolute log fold change. |
model |
A fitted GAM/SCAM model used to predict statistical power. |
xmin |
Numeric value giving the minimum sample size considered in the search. |
xmax |
Numeric value giving the maximum sample size considered in the search. |
maxiter |
maximum number of iterations |
max_report |
maximum group sample size to be predicted. Any predicted sample size that exceed max_report will be reported as "> max_report" |
A numeric value corresponding to the estimated sample size required to achieve the target power.