Divide-and-conquer method for the fitting of adaptive lasso model with big data

dcalasso fits adaptive lasso for big datasets using multiple linearization methods, including one-step estimation and least square approximation. This function is able to fit the adaptive lasso model either when the dataset is being loaded as a whole into data or when the datasets are splitted a priori and saved into multiple rds files. The algorithm uses a divide-and-conquer one-step estimator as the initial estimator and uses a least square approximation to the partial likelihood, which reduces the computation cost. The algorithm currently supports adaptive lasso with Cox proportional hazards model with or without time-dependent covariates. Ties in survival data analysis are handled by Efron's method. The first half of the routine computes an initial estimator (n^1/2 consistent estimator). It first obtains a warm-start by fitting coxph to the first subset (first random split of data or first data file indicated by data.rds) and then uses one-step estimation with iter.os rounds to update the warm-start. The one-step estimation loops through each subset and gathering scores and information matrices. The second half of the routine then shrinks the initial estimator using a least square approximation-based adaptive lasso step.

dcalasso(formula, family = cox.ph(), data = NULL, data.rds = NULL,
  weights, subset, na.action, offset, lambda = 10^seq(-10, 3, 0.01),
  gamma = 1, K = 20, iter.os = 2, ncores = 1)

Arguments

formula	a formula specifying the model. For Cox model, the outcome should be specified as the Surv(start, stop, status) or Surv(start, status) object in the survival package.
family	For Cox model, family should be cox.ph(), or "cox.ph".
data	data frame containing all variables.
data.rds	when the dataset is too big to load as a whole into the RAM, one can specify `data.rds` which are the full paths of all randomly splitted subsets of the full data, saved into multiple `.rds` format.
weights	a prior weights on each observation
subset	an expression indicating subset of rows of data used in model fitting
na.action	how to handle NA
offset	an offset term with a fixed coefficient of one
lambda	tuning parameter for the adaptive lasso penalty. penalty = lambda * sum_j \|beta_j\|/\|beta_j initial\|^gamma
gamma	exponent of the adaptive penalty. penalty = lambda * sum_j \|beta_j\|/\|beta_j initial\|^gamma
K	number of division of the full dataset. It will be overwritten to `length(data.rds)` if data.rds is given.
iter.os	number of iterations for one-step updates
ncores	number of cores to use. The iterations will be paralleled using `foreach` if ncores>1.

Value

coefficients.pen

adaptive lasso shrinkage estimation

coefficients.unpen

initial unregularized estimator

cov.unpen

variance-covariance matrix of unpenalized model

cov.pen

variance-covariance matrix of penalized model

BIC

sequence of BIC evaluation at each lambda

n.pen

number use to penalize the degrees of freedom in BIC.

number of used rows of the data

idx.opt

index for the optimal BIC

BIC.opt

minimal BIC

family

family object of the model

lamba.opt

optimal lambda to minimize BIC

degrees of freedom at each lambda

number of covariates

iter

number of one-step iterations

Terms

term object of the model

References

Wang, Yan, Chuan Hong, Nathan Palmer, Qian Di, Joel Schwartz, Isaac Kohane, and Tianxi Cai. "A Fast Divide-and-Conquer Sparse Cox Regression." arXiv preprint arXiv:1804.00735 (2018).

Author

Yan Wang yaw719@mail.harvard.edu, Tianxi Cai tcai@hsph.harvard.edu, Chuan Hong <Chuan_Hong@hms.harvard.edu>