Calculate pairwise string similarity — simi_ptn

The function calculates pairwise similarity based on Levenshtein edit distance for columns "title_norm" and "abstract_norm" between records within the same group after partitioning. Range of similarity is [0, 1]. Similarity == 1 means 100% identical while Similarity == 0 means completely different.

Usage

simi_ptn_pair(df, partition_by = "first_two_letters_first_author_last_name")

Arguments

df

A data frame with bibliographic information that has gone through text normalization. df must have the following columns c("title_norm", "abstract_norm").

partition_by

Quoted name of the column by which to partition the rows. Defaults to "first_two_letters_first_author_last_name". Can be FALSE if prefer not to partition, in which case all records are compared against all others.

Besides the default, "year" is another popular partitioning parameter. We recommend the default method if papers in your dataset are not evenly distributed across years. For instance, if most papers are recent, the dafault method will be much more efficient than "year". Additionally, with the prevalence of preprints, partitioning by "year" becomes less accurate.

In addition, you can also construct a custom "partition" column.

Value

Two list of data frames: (1) A list of data frames containing the partitioned df; (2) A list of data frames with string similarity results for "title_norm" and "abstract_norm".

Details

An artificial code "00" is assigned to cells with missing values in the partition_by column and these rows are partitioned into one group. If you customize your own partitioning parameter, try to avoid this artificial code.

Computing time estimation according to past experience: ~ 23 min for a data frame with 3832 rows on a Macbook Pro (Apple M1 Pro chip basic model, memory: 16 GB). Consider running it on a high performance computing cluster if you want to shorten the time.

Examples

if (FALSE) {
# load example dataset
data(bib_example_small)

# text normalization of the data frame
df <- norm_df(bib_example_small)

# calculate similarity
c(ls_df, ls_df_simi) %<-% simi_ptn_pair(df, partition_by = "first_two_letters_first_author_last_name")
}