Find duplicates by exact match and remove them
dedu_exact.Rd
This automatically removes duplicates identified by exact match without manual review.
The most recent version will be retained at removal.
Support deduplication based on multiple columns (one at a time).
Arguments
- df
A data frame with bibliographic information that has gone through text normalization.
df
must have the following 6 columnsc("author", "title", "journal", "abstract", "year", "doi")
.- match_by
Quoted name(s) of the column(s) with information to match by (e.g.,
"doi_norm"
,"title"
,"title_norm"
,c("doi_norm", "title", "title_norm")
).If supplying a character vector with multiple elements, deduplication will be performed in order. For example, if
match_by = c("doi_norm", "title", "title_norm")
, deduplication will be performed first according to"doi_norm"
, then"title"
, and finally"title_norm"
.
Examples
# load example dataset
data(bib_example_small)
# text normalization of the data frame
df <- norm_df(bib_example_small)
# deduplicate according to 3 columns in order (one at a time)
df_new <- dedu_exact(df, match_by = c("doi_norm", "title", "title_norm"))
#> Error in df_no %>% relocate(match, .after = last_col()): could not find function "%>%"