Main function to perform MAP algorithm to calculate predicted probabilities of positive phenotype for each patient based on NLP and ICD counts adjusted for healthcare utilization. For large number of patients (>50k) it may take very long to compute, so a subset_sample parameter is provided to perform the fit on a subset of patients and project the remaining. The subset_sample_size controls the maximum number of patients on which to perform the fit.
Usage
MAP(
mat = NULL,
note = NULL,
yes.con = FALSE,
full.output = FALSE,
subset_sample = FALSE,
subset_sample_size = 5000,
verbose = TRUE
)
Arguments
- mat
Count data (sparse matrix). One of the columns has to be ICD data with name being ICD.
- note
Note count (sparse matrix) indicating healthcare utilization.
- yes.con
A logical variable indicating if concomitant is desired. Not used for now.
- full.output
A logical variable indicating if full outputs are desired.
- subset_sample
Logical, perform fit on a subset of patients and project remaining.
- subset_sample_size
If subset_sample TRUE, number of patients on which to perform the fit (default 50k).
- verbose
Print model information
Value
Returns a list with following objects:
- scores
Indicates predicted probabilities.
- cut.MAP
The cutoff value that can be used to derive binary phenotypes.
References
High-throughput Multimodal Automated Phenotyping (MAP) with Application to PheWAS. Katherine P. Liao, Jiehuan Sun, Tianrun A. Cai, Nicholas Link, Chuan Hong, Jie Huang, Jennifer Huffman, Jessica Gronsbell, Yichi Zhang, Yuk-Lam Ho, Victor Castro, Vivian Gainer, Shawn Murphy, Christopher J. O’Donnell, J. Michael Gaziano, Kelly Cho, Peter Szolovits, Isaac Kohane, Sheng Yu, and Tianxi Cai with the VA Million Veteran Program (2019) <doi:10.1101/587436>.
Examples
## simulate data to test the algorithm
n = 400
ICD = c(rpois(n/4,10), rpois(n/4,1), rep(0,n/2) )
NLP = c(rpois(n/4,10), rpois(n/4,1), rep(0,n/2) )
mat = Matrix(data=cbind(ICD,NLP),sparse = TRUE)
note = Matrix(rpois(n,10)+5,ncol=1,sparse = TRUE)
res = MAP(mat = mat, note=note)
#> #######################
#> MAP only considers patients who have note count data and
#> at least one non-missing variable
#> ####
#> Here is a summary of the input data:
#> Total number of patients: 400
#> ICD NLP note Freq
#> 1 YES YES YES 400
#> ####
head(res$scores)
#> 6 x 1 sparse Matrix of class "dgCMatrix"
#>
#> [1,] 0.9384403
#> [2,] 0.9662253
#> [3,] 0.9998453
#> [4,] 0.9999777
#> [5,] 0.9999809
#> [6,] 0.6284407
res$cut.MAP
#> [1] 0.3395483