get_eval_embed.Rd
get_eval_embed
acts as embedding generation & evaluation from co-occurrence data file.
It returns a list of summary including meta-data, evaluation and embedding itself.
get_eval_embed(
CO_file,
freq_file,
dims = seq(100, 1000, 100),
out_dir = NULL,
freq_min = 1000,
threshold = 10,
normalize = TRUE,
use.dataframe = FALSE,
save.summary = TRUE
)
Co-occurrence data file with format .csv
, .parquet
or .Rdata
.
If use.dataframe
= TRUE
, then it should be a R dataframe variable.
The data should be a table with 3 columns index1, index2, count:
index1
: Shows the index of code1.
index2
: Shows the col index of code2.
count
: Shows the counts for certain pair.
If the columns are not index1
, index2
and count
, it will consider the first 3 columns as the corresponding columns.
Frequency count file with format .csv
, .parquet
or .Rdata
.
If use.dataframe
= TRUE
, then it should be a R dataframe variable.
The data should be a table with 4 columns index, code, description, freq_count:
index
: Shows the index of code.
code
: Shows the name of code.
description
: Shows the description text of code.
freq_count
: Shows the frequency count of code.
If the columns are not index1
, code
, description
and freq_count
, it will consider the first 4 columns as the corresponding columns.
A vector of numeric values for dimension, by default is seq(100, 1000, 100)
.
Output folder, if NULL
then by default set to your_working_directory/output.s
The frequency counts cutoff for code filtering. If the counts are less than freq_min
, it’ll be filtered
out. By default is 1000
.
Integer number, the threshold to get SPPMI matrix, by default is 10
.
TRUE
or FALSE
, to normalize embedding or not. By default is True
.
TRUE
or FALSE
. If TURE
, CO_file
and freq_file
will be accepted as a R data frame variable, other than a file name.
TRUE
or FALSE
. If FALSE
, the function will not save results to location out_dir
.
A list of information of meta-data, embedding & evaluation result. It will
be saved in out_dir
as .Rdata
file.