get_embed_regression acts as embedding regression to select related codes for interested PheCodes. It returns a list of summary objects.

get_embed_regression(
  embed_train,
  embed_valid,
  phecodes,
  dim,
  lambda_vec = c(seq(1, 51, 1) * 1e-06, seq(60, 1000, 50) * 1e-06),
  alpha = 0.25,
  custom_dict = NULL
)

Arguments

embed_train

Tranning embedding for regression.

embed_valid

Validation embedding for regression.

phecodes

Interested phecodes.

dim

Dimension used for embedding, for AIC/BIC calculation.

lambda_vec

Lambda candidates for glmnet, it's very data specific. By default: c(seq(1, 51, 1) \* 1e-6, seq(60, 1000, 50) \* 1e-6)

alpha

Alpha value for glmnet, by defaut is 0.25

custom_dict

Dictionary file for codes mapping. If not offered, the internal dictionary will be used. Data structure:

  • code: codes like PheCode:714.1

  • desc: descriptions like rheumatoid arthritis

Value

A list of information including:

  • summary_data: Regression summary of selected codes, beta's, cosine values and code description.

  • Nlist: Number of non-zero beta's over lambda.

  • min_lambdas: The best lambda of mininmun AIC + Testing Residual for interested Phecodes.

  • eval_plots: Plots of Residuals over log(lambda) for interested Phecodes.

  • wordcloud_plots: Word cloud plots for selected features magnified by cosine values.

  • selected_features: Selected features, it filters out features in summary_data where beta not equal to 0.