Specify the data to be used for phenotyping.

PhecapData(
  data, hu_feature, label, validation,
  patient_id = NULL, subject_weight = NULL,
  seed = 12300L, feature_transformation = log1p)

Arguments

data

A data.frame consisting of all the variables needed for phenotyping, or a character scalar of the path to the data, or a list consisting of either character scalar or data.frame. If a list is given, patient_id cannot be NULL. All the datasets in the list will be joined into a single dataset according to the columns specified by patient_id.

hu_feature

A character scalar or vector specifying the names of one of more healthcare utilization (HU) variables. There variables are always included in the phenotyping model.

label

A character scalar of the column name that gives the phenotype status (1 or TRUE: present, 0 or FALSE: absent). If label is not ready yet, just put a column filled with NA in data. In such cases only the feature extraction step can be done.

validation

A character scalar, a real number strictly between 0 and 1, or an integer not less than 2. If a character scalar is used, it is treated as the column name in the data that specifies whether this observation belongs to the validation samples (1 or TRUE: validation, 0 or FALSE: training). If a real number strictly between 0 and 1 is used, it is treated as the proportion of the validation samples. The actual validation samples will be drawn from all labeled samples. If an integer not less than 2 is used, it is treated as the size of the validation samples. The actual validation samples will be drawn from all labeled samples.

patient_id

A character vector for the column names, if any, that uniquely identifies each patient. Such variables must appear in the data. patient_id can be NULL if such fields are not contained in the data.

subject_weight

An optional numeric vector of weights for observations.

seed

If validation samples need to be drawn from all labeled samples, seed specifies the random seed for sampling.

feature_transformation

A function that will be applied to all the features. Since count data are typically right-skewed, by default log1p will be used. feature_transformation can be NULL, in which case no transformation will be done on any of the feature.

Value

An object of class PhecapData.

See also

See PheCAP-package for code examples.