set.seed(1)
= 3000 # The total number of entities (rows/columns of the matrix)
N = 10 # The rank of the underlying low-rank matrix
r = 5 # The number of data sources (matrices)
m = 0.1 # The probability of observing an entry in each source p
Multi-source Learning via Completion of Block-wise Overlapping Noisy Matrices
Exploit the orthogonal Procrustes problem to align the eigenspace of the two sub-matrices, then complete the missing blocks by the inner product of the two low-rank components. Background and details about the methods can be found at Zhou et al.(2021)https://doi.org/10.48550/arXiv.2105.10360
Detail
The BONMI_package.R script is designed for efficient matrix transformation and dimensionality reduction, particularly tailored for analyzing Positive Pointwise Mutual Information (PPMI) matrices. It features the Procrustes function, which aligns two matrices by computing an orthogonal transformation matrix that minimizes their Frobenius norm difference. The embedding function extracts a low-dimensional representation from singular value decomposition (SVD) results to simplify complex data. The main function, BONMI, processes a list of PPMI matrices, optionally applies weights, and leverages standard or randomized SVD to create reduced-rank embeddings. It aligns these embeddings with the Procrustes function, ensuring consistent representation across datasets.
Reference
Zhou, D., Cai, T., & Lu, J. (2021). Multi-source learning via completion of block-wise overlapping noisy matrices. arXiv preprint arXiv:2105.10360.
Finlayson, S. G., LePendu, P., and Shah, N. H. (2014). Building the graph of medicine from millions of clinical narratives. Scientific data, 1:140032.
Johnson, A. E., Pollard, T. J., Shen, L., Li-Wei, H. L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. (2016). Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9.
Example
1. Setting Parameters and Generating Data
- The
set.seed(1)
ensures reproducibility by initializing the random number generator.
2. Creating the Low-Rank Matrix W0
= matrix(rnorm(N*r), nrow=N)
X0 = X0 %*% t(X0)
W0 rownames(W0) = colnames(W0) = paste0('code', 1:N)
X0
is anN x r
matrix filled with random values from a standard normal distribution.W0
is anN x N
low-rank symmetric matrix obtained by multiplyingX0
by its transpose.rownames
andcolnames
are set tocode1
tocode3000
to identify the rows and columns.
3. Creating Multiple Noisy Submatrices
= list()
W for(s in 1:m){
= which(runif(N) < p) # Randomly selecting observed indices
ids = length(ids) # Number of observed entries
Ns = matrix(rnorm(Ns*Ns, sd=s*0.01), nrow=Ns) # Noise matrix with increasing variance
Es = Es + t(Es) # Ensuring the noise matrix is symmetric
Es = W0[ids, ids] + Es # Adding noise to the observed submatrix
Ws = Ws # Storing the noisy submatrix in a list
W[[s]] }
- This loop creates
m
submatrices with block-wise observations. - Each submatrix
Ws
is sampled fromW0
and has added symmetric noise. - The result is stored in the list
W
, which contains all submatrices.
4. Using BONMI
for Matrix Completion
<- BONMI(W, r, weights=NULL, rsvd.use=FALSE)
Xhat #Xhat <- BONMI(W, r, weights=NULL, rsvd.use=TRUE)
##weights: the weight vector for the PPMI matrices. If weights = NULL, then it will be estimated from data.
##rsvd.use: If rsvd.use=TRUE, we will use the 'rsvd' function to calculate the svd, which is much faster than the 'svd' function when r is small.
Xhat
is the output of theBONMI
function, representing the estimated low-rank embeddings.- The
weights
parameter is set toNULL
, andrsvd.use=FALSE
means standard SVD is used. - Optionally,
rsvd.use=TRUE
can be set to use randomized SVD.
5. Post-Processing and Result Verification
= rownames(Xhat)
codes = Xhat %*% t(Xhat)
What
= match(codes, rownames(W0))
id = W0[id, id]
Wstar
#BONMI's result
norm(What - Wstar, 'F') / norm(Wstar, 'F')
[1] 0.003746704
What
is the reconstructed matrix using the output embeddingsXhat
.id
finds the matching indices betweenXhat
and the original matrixW0
.Wstar
is the corresponding submatrix ofW0
for comparison.- The result
norm(What - Wstar, 'F') / norm(Wstar, 'F')
calculates the relative Frobenius norm error, measuring how closeWhat
is to the true submatrixWstar
.
<-function(embed){
Myeval<- data.frame(code = rownames(embed))
dict $group <- sapply(dict$code, function(x) strsplit(x,":")[[1]][1])
dict<- get_AUC_embed(embed = embed, pairs = pairs, WithNull = FALSE, dict = dict,
AUC normalize = FALSE, interested_code = NULL, nmax = 10000, myseed = 214)
= AUC$`codi-codi(similar)`
sim = AUC$`codi-codi(related)`
rel = sim[sim[,2]>50,]
sim = rel[rel[,2]>50,]
rel
$weighted_sim = c('auc'=sum(sim[,1]*sim[,2])/sum(sim[,2]),'num'=sum(sim[,2]),'dim'=nrow(embed))
AUC$weighted_rel = c('auc'=sum(rel[,1]*rel[,2])/sum(rel[,2]),'num'=sum(rel[,2]),'dim'=nrow(embed))
AUC
AUC }
= Embed(fit.W,r)
X rownames(X) = rownames(Wm)
= X[rownames(X)%in%rownames(W[[1]]),]
X1 = X[rownames(X)%in%rownames(W[[2]]),]
X2 = X[rownames(X)%in%rownames(W[[3]]),]
X3 = X[rownames(X)%in%rownames(W[[4]]),]
X4 = X[rownames(X)%in%rownames(W[[5]]),]
X5
load(file='/n/data1/hsph/biostat/celehs/lab/SHARE/Relationship Pairs/PairsAlltypeUMLS_05122024.Rdata')
source("/n/data1/hsph/biostat/celehs/lab/SHARE/Relationship Pairs/AUC Code/get_AUC_MS.R")
= Myeval(X)
auc.BONMI = Myeval(X1)
auc1 = Myeval(X2)
auc2 = Myeval(X3)
auc3 = Myeval(X4)
auc4 = Myeval(X5)
auc5 = rbind(auc1$weighted_sim,auc2$weighted_sim,auc3$weighted_sim,auc4$weighted_sim,auc5$weighted_sim,auc.BONMI$weighted_sim)
res.sim round(res.sim,3)
# auc num dim
# [1,] 0.920 16567 9620
# [2,] 0.920 16000 6964
# [3,] 0.895 18007 18432
# [4,] 0.905 12261 5203
# [5,] 0.879 2469 4341
# [6,] 0.882 20276 31233
# auc num dim
# [1,] 0.943 16567 9620
# [2,] 0.946 16000 6964
# [3,] 0.926 18007 18432
# [4,] 0.934 12261 5203
# [5,] 0.922 2469 4341
# [6,] 0.915 20276 31233
# auc num dim
# [1,] 0.944 16567 9620
# [2,] 0.947 16000 6964
# [3,] 0.924 18007 18432
# [4,] 0.934 12261 5203
# [5,] 0.926 2469 4341
# [6,] 0.912 20276 31233
= rbind(auc1$weighted_rel,auc2$weighted_rel,auc3$weighted_rel,auc4$weighted_rel,auc5$weighted_rel,auc.BONMI$weighted_rel)
res.rel round(res.rel,3)
# auc num dim
# [1,] 0.849 20154 9620
# [2,] 0.862 18334 6964
# [3,] 0.842 22040 18432
# [4,] 0.837 16219 5203
# [5,] 0.806 4628 4341
# [6,] 0.790 31291 31233
# auc num dim
# [1,] 0.867 20154 9620
# [2,] 0.882 18334 6964
# [3,] 0.862 22040 18432
# [4,] 0.860 16219 5203
# [5,] 0.842 4628 4341
# [6,] 0.813 31291 31233
# auc num dim
# [1,] 0.867 20154 9620
# [2,] 0.881 18334 6964
# [3,] 0.860 22040 18432
# [4,] 0.858 16219 5203
# [5,] 0.845 4628 4341
# [6,] 0.812 31291 31233