Find duplicates by fuzzy match of string similarity between adjacent rows
dup_find_fuzzy_adj.Rd
Find duplicates by fuzzy match of string similarity between adjacent rows
Arguments
- df
An ordered data frame (i.e., output #1 of
simi_order_adj()
).- df_simi
A data frame with string similarity results calculated from
df
(i.e., output #2 ofsimi_order_adj()
).- cutoff_title
Numeric: cutoff threshold of string similarity for normalized title. Range: [0, 1]. Defaults to 0.7.
- cutoff_abstract
Numeric: cutoff threshold of string similarity for normalized abstract. Range: [0, 1]. Defaults to 0.7.
Value
Two data frames: (1) the input df
with "match"
column added; (2) A data frame listing id
of duplicate pairs.
Details
For both cutoffs:
We recommend choosing sensible dataset-aware values according to the similarity distribution plot generated by plot_simi_dist()
.