Skip to contents

Find duplicates by fuzzy match of string similarity between pairwise records

Usage

dup_find_fuzzy_pairwise(
  ls_df,
  ls_df_simi,
  cutoff_title = 0.7,
  cutoff_abstract = 0.7
)

Arguments

ls_df

A list of data frames containing the partitioned dataset (i.e., output #1 of simi_ptn_pair()).

ls_df_simi

A list of data frames with string similarity results calculated (i.e., output #2 of simi_ptn_pair()).

cutoff_title

Numeric: cutoff threshold of string similarity for normalized title. Range: [0, 1]. Defaults to 0.7.

cutoff_abstract

Numeric: cutoff threshold of string similarity for normalized abstract. Range: [0, 1]. Defaults to 0.7.

Value

A data frame listing record id and partition id of duplicate pairs.

Details

For both cutoffs:

Cutoff thresholds in dup_find_fuzzy_adj() are usually applicable here. Alternatively, you can re-examine the similarity distribution plots by plot_simi_dist() and choose sensible values.

Examples

if (FALSE) {
id_dup_pair <- dup_find_fuzzy_pairwise(ls_df, ls_df_simi, cutoff_title = 0.7, cutoff_abstract = 0.7)
}