Find duplicates by exact match and remove them

This automatically removes duplicates identified by exact match without manual review.
The most recent version will be retained at removal.
Support deduplication based on multiple columns (one at a time).

Usage

dedu_exact(df, match_by)

Arguments

df

A data frame with bibliographic information that has gone through text normalization. df must have the following 6 columns c("author", "title", "journal", "abstract", "year", "doi").

match_by

Quoted name(s) of the column(s) with information to match by (e.g., "doi_norm", "title", "title_norm", c("doi_norm", "title", "title_norm")).

If supplying a character vector with multiple elements, deduplication will be performed in order. For example, if match_by = c("doi_norm", "title", "title_norm"), deduplication will be performed first according to "doi_norm", then "title", and finally "title_norm".

Value

Deduplicated df.

Details

Records with missing information (i.e., NA) in match_by column won't be modified.

Examples

# load example dataset
data(bib_example_small)

# text normalization of the data frame
df <- norm_df(bib_example_small)

# deduplicate according to 3 columns in order (one at a time)
df_new <- dedu_exact(df, match_by = c("doi_norm", "title", "title_norm"))
#> Error in df_no %>% relocate(match, .after = last_col()): could not find function "%>%"