Train the phenotyping model on the training dataset, and evaluate its performance via random splits of the training dataset.

phecap_train_phenotyping_model(
  data, surrogates, feature_selected,
  method = "lasso_bic",
  train_percent = 0.7, num_splits = 200L,
  start_seed = 78900L, verbose = 0L)

Arguments

data

an object of class PhecapData, obtained by calling PhecapData(...).

surrogates

a list of objects of class PhecapSurrogate, obtained by something like list(PhecapSurrogate(...), PhecapSurrogate(...)). The surrogates used here might be different from that used in feature extraction.

feature_selected

a character vector of the features that should be included in the model, probably returned by phecap_run_feature_extraction (but not necessary). The features listed here might be different from those returned from feature extraction.

method

Either a character vector or a list of two components. If a character vector is used, possible entries are given below. When at least two methods are specified, the predicted probability is the simple average of the predicted probabilities from each method.

  • 'plain' (logistic regression without penalty)

  • 'ridge_cv' (logistic regression with ridge penalty and CV tuning)

  • 'lasso_cv' (logistic regression with lasso penalty and CV tuning)

  • 'lasso_bic' (logistic regression with lasso penalty and BIC tuning)

  • 'alasso_cv' (logistic regression with adaptive lasso penalty and CV tuning)

  • 'alasso_bic' (logistic regression with adaptive lasso penalty and BIC tuning)

  • 'svm' (support vector machine with CV tuning, package e1071 needed, subject_weight not supported)

  • 'rf' (random forest with default parameters, package randomForestSRC needed)

  • 'xgb' (extreme gradient boosting with default parameters, package xgboost needed)

If a list is used, it should contain two named components as follows.

  • fit (a function for model fitting, with arguments x, y, subject_weight, penalty_weight)

  • predict (a function for prediction, with arguments object which was returned by fit, x which was used as the new data to predict on)

train_percent

The percentage (between 0 and 1) of labels that are used for model training during random splits

num_splits

The number of random splits.

start_seed

in the i-th split, the seed is set to start_seed + i.

verbose

print progress every verbose splits if verbose is positive, or remain quiet if verbose is zero

Value

An object of class PhecapModel, with components

coefficients

the fitted object

method

the method used for model training

feature_selected

the feature selected by SAFE

train_roc

ROC on training dataset

train_auc

AUC on training dataset

split_roc

average ROC on random splits of training dataset

split_auc

average AUC on random splits of training dataset

fit_function

the function used for fitting

predict_function

the function used for prediction

See also

See PheCAP-package for code examples.