Skip to contents

Find duplicates by fuzzy match of string similarity between adjacent rows

Usage

dup_find_fuzzy_adj(df, df_simi, cutoff_title = 0.7, cutoff_abstract = 0.7)

Arguments

df

An ordered data frame (i.e., output #1 of simi_order_adj()).

df_simi

A data frame with string similarity results calculated from df (i.e., output #2 of simi_order_adj()).

cutoff_title

Numeric: cutoff threshold of string similarity for normalized title. Range: [0, 1]. Defaults to 0.7.

cutoff_abstract

Numeric: cutoff threshold of string similarity for normalized abstract. Range: [0, 1]. Defaults to 0.7.

Value

Two data frames: (1) the input df with "match" column added; (2) A data frame listing id of duplicate pairs.

Details

For both cutoffs:

We recommend choosing sensible dataset-aware values according to the similarity distribution plot generated by plot_simi_dist().

Examples

if (FALSE) {
c(df, id_dup_pair) %<-% dup_find_fuzzy_adj(df, df_simi, cutoff_title = 0.7, cutoff_abstract = 0.7)
}