Calculate pairwise string similarity
simi_ptn_pair.Rd
The function calculates pairwise similarity based on Levenshtein edit distance for columns "title_norm"
and "abstract_norm"
between records within the same group after partitioning. Range of similarity is [0, 1]. Similarity == 1
means 100% identical while Similarity == 0
means completely different.
Arguments
- df
A data frame with bibliographic information that has gone through text normalization.
df
must have the following columnsc("title_norm", "abstract_norm")
.- partition_by
Quoted name of the column by which to partition the rows. Defaults to
"first_two_letters_first_author_last_name"
. Can beFALSE
if prefer not to partition, in which case all records are compared against all others.Besides the default,
"year"
is another popular partitioning parameter. We recommend the default method if papers in your dataset are not evenly distributed across years. For instance, if most papers are recent, the dafault method will be much more efficient than"year"
. Additionally, with the prevalence of preprints, partitioning by"year"
becomes less accurate.In addition, you can also construct a custom "partition" column.
Value
Two list of data frames: (1) A list of data frames containing the partitioned df
; (2) A list of data frames with string similarity results for "title_norm"
and "abstract_norm"
.
Details
An artificial code "00" is assigned to cells with missing values in the partition_by
column and these rows are partitioned into one group. If you customize your own partitioning parameter, try to avoid this artificial code.
Computing time estimation according to past experience: ~ 23 min for a data frame with 3832 rows on a Macbook Pro (Apple M1 Pro chip basic model, memory: 16 GB). Consider running it on a high performance computing cluster if you want to shorten the time.