Generate & Evaluate Embedding — get_eval

get_eval_embed acts as embedding generation & evaluation from co-occurrence data file. It returns a list of summary including meta-data, evaluation and embedding itself.

get_eval_embed(
  CO_file,
  freq_file,
  dims = seq(100, 1000, 100),
  out_dir = NULL,
  freq_min = 1000,
  threshold = 10,
  normalize = TRUE,
  use.dataframe = FALSE,
  save.summary = TRUE
)

Arguments

CO_file

Co-occurrence data file with format .csv, .parquet or .Rdata. If use.dataframe = TRUE, then it should be a R dataframe variable. The data should be a table with 3 columns index1, index2, count:

index1: Shows the index of code1.
index2: Shows the col index of code2.
count: Shows the counts for certain pair.

If the columns are not index1, index2 and count, it will consider the first 3 columns as the corresponding columns.

freq_file

Frequency count file with format .csv, .parquet or .Rdata. If use.dataframe = TRUE, then it should be a R dataframe variable. The data should be a table with 4 columns index, code, description, freq_count:

index: Shows the index of code.
code: Shows the name of code.
description: Shows the description text of code.
freq_count: Shows the frequency count of code.

If the columns are not index1, code, description and freq_count, it will consider the first 4 columns as the corresponding columns.

dims

A vector of numeric values for dimension, by default is seq(100, 1000, 100).

out_dir

Output folder, if NULL then by default set to your_working_directory/output.s

freq_min

The frequency counts cutoff for code filtering. If the counts are less than freq_min, it’ll be filtered out. By default is 1000.

threshold

Integer number, the threshold to get SPPMI matrix, by default is 10.

normalize

TRUE or FALSE, to normalize embedding or not. By default is True.

use.dataframe

TRUE or FALSE. If TURE, CO_file and freq_file will be accepted as a R data frame variable, other than a file name.

save.summary

TRUE or FALSE. If FALSE, the function will not save results to location out_dir.

Value

A list of information of meta-data, embedding & evaluation result. It will be saved in out_dir as .Rdata file.