Skip to contents
  • This automatically removes duplicates identified by exact match without manual review.

  • The most recent version will be retained at removal.

  • Support deduplication based on multiple columns (one at a time).

Usage

dedu_exact(df, match_by)

Arguments

df

A data frame with bibliographic information that has gone through text normalization. df must have the following 6 columns c("author", "title", "journal", "abstract", "year", "doi").

match_by

Quoted name(s) of the column(s) with information to match by (e.g., "doi_norm", "title", "title_norm", c("doi_norm", "title", "title_norm")).

If supplying a character vector with multiple elements, deduplication will be performed in order. For example, if match_by = c("doi_norm", "title", "title_norm"), deduplication will be performed first according to "doi_norm", then "title", and finally "title_norm".

Value

Deduplicated df.

Details

Records with missing information (i.e., NA) in match_by column won't be modified.

Examples

# load example dataset
data(bib_example_small)

# text normalization of the data frame
df <- norm_df(bib_example_small)

# deduplicate according to 3 columns in order (one at a time)
df_new <- dedu_exact(df, match_by = c("doi_norm", "title", "title_norm"))
#> Error in df_no %>% relocate(match, .after = last_col()): could not find function "%>%"