Find duplicates by fuzzy match of string similarity between adjacent rows

Usage

dup_find_fuzzy_adj(df, df_simi, cutoff_title = 0.7, cutoff_abstract = 0.7)

df: An ordered data frame (i.e., output #1 of simi_order_adj()).
df_simi: A data frame with string similarity results calculated from df (i.e., output #2 of simi_order_adj()).
cutoff_title: Numeric: cutoff threshold of string similarity for normalized title. Range: [0, 1]. Defaults to 0.7.
cutoff_abstract: Numeric: cutoff threshold of string similarity for normalized abstract. Range: [0, 1]. Defaults to 0.7.

Two data frames: (1) the input df with "match" column added; (2) A data frame listing id of duplicate pairs.

For both cutoffs:

We recommend choosing sensible dataset-aware values according to the similarity distribution plot generated by plot_simi_dist().

if (FALSE) {
c(df, id_dup_pair) %<-% dup_find_fuzzy_adj(df, df_simi, cutoff_title = 0.7, cutoff_abstract = 0.7)
}