dcalasso fits adaptive lasso for big datasets using multiple linearization methods, including one-step estimation and least square approximation. This function is able to fit the adaptive lasso model either when the dataset is being loaded as a whole into data or when the datasets are splitted a priori and saved into multiple rds files. The algorithm uses a divide-and-conquer one-step estimator as the initial estimator and uses a least square approximation to the partial likelihood, which reduces the computation cost. The algorithm currently supports adaptive lasso with Cox proportional hazards model with or without time-dependent covariates. Ties in survival data analysis are handled by Efron's method. The first half of the routine computes an initial estimator (n^1/2 consistent estimator). It first obtains a warm-start by fitting coxph to the first subset (first random split of data or first data file indicated by data.rds) and then uses one-step estimation with iter.os rounds to update the warm-start. The one-step estimation loops through each subset and gathering scores and information matrices. The second half of the routine then shrinks the initial estimator using a least square approximation-based adaptive lasso step.

dcalasso(formula, family = cox.ph(), data = NULL, data.rds = NULL,
  weights, subset, na.action, offset, lambda = 10^seq(-10, 3, 0.01),
  gamma = 1, K = 20, iter.os = 2, ncores = 1)

Arguments

formula

a formula specifying the model. For Cox model, the outcome should be specified as the Surv(start, stop, status) or Surv(start, status) object in the survival package.

family

For Cox model, family should be cox.ph(), or "cox.ph".

data

data frame containing all variables.

data.rds

when the dataset is too big to load as a whole into the RAM, one can specify data.rds which are the full paths of all randomly splitted subsets of the full data, saved into multiple .rds format.

weights

a prior weights on each observation

subset

an expression indicating subset of rows of data used in model fitting

na.action

how to handle NA

offset

an offset term with a fixed coefficient of one

lambda

tuning parameter for the adaptive lasso penalty. penalty = lambda * sum_j |beta_j|/|beta_j initial|^gamma

gamma

exponent of the adaptive penalty. penalty = lambda * sum_j |beta_j|/|beta_j initial|^gamma

K

number of division of the full dataset. It will be overwritten to length(data.rds) if data.rds is given.

iter.os

number of iterations for one-step updates

ncores

number of cores to use. The iterations will be paralleled using foreach if ncores>1.

Value

coefficients.pen

adaptive lasso shrinkage estimation

coefficients.unpen

initial unregularized estimator

cov.unpen

variance-covariance matrix of unpenalized model

cov.pen

variance-covariance matrix of penalized model

BIC

sequence of BIC evaluation at each lambda

n.pen

number use to penalize the degrees of freedom in BIC.

n

number of used rows of the data

idx.opt

index for the optimal BIC

BIC.opt

minimal BIC

family

family object of the model

lamba.opt

optimal lambda to minimize BIC

df

degrees of freedom at each lambda

p

number of covariates

iter

number of one-step iterations

Terms

term object of the model

References

Wang, Yan, Chuan Hong, Nathan Palmer, Qian Di, Joel Schwartz, Isaac Kohane, and Tianxi Cai. "A Fast Divide-and-Conquer Sparse Cox Regression." arXiv preprint arXiv:1804.00735 (2018).

Author

Yan Wang yaw719@mail.harvard.edu, Tianxi Cai tcai@hsph.harvard.edu, Chuan Hong <Chuan_Hong@hms.harvard.edu>