Find duplicates by fuzzy match of string similarity between pairwise records
dup_find_fuzzy_pairwise.Rd
Find duplicates by fuzzy match of string similarity between pairwise records
Arguments
- ls_df
A list of data frames containing the partitioned dataset (i.e., output #1 of
simi_ptn_pair()
).- ls_df_simi
A list of data frames with string similarity results calculated (i.e., output #2 of
simi_ptn_pair()
).- cutoff_title
Numeric: cutoff threshold of string similarity for normalized title. Range: [0, 1]. Defaults to 0.7.
- cutoff_abstract
Numeric: cutoff threshold of string similarity for normalized abstract. Range: [0, 1]. Defaults to 0.7.
Details
For both cutoffs:
Cutoff thresholds in dup_find_fuzzy_adj()
are usually applicable here. Alternatively, you can re-examine the similarity distribution plots by plot_simi_dist()
and choose sensible values.