Skip to contents

Find duplicates by exact match

Usage

dup_find_exact(
  df,
  match_by,
  double_check = FALSE,
  check_by = "first_author_last_name_norm"
)

Arguments

df

A data frame with bibliographic information that has gone through text normalization. df must have the following 6 columns c("author", "title", "journal", "abstract", "year", "doi").

match_by

Quoted name of the column with information to match by (e.g., "doi_norm", "title", "title_norm").

double_check

Logical: Is confirmarion against another column needed? Defaults to FALSE.

check_by

Quoted name of the column with information to double check by. Not required/ignored if double_check == FALSE. Defaults to "first_author_last_name_norm" if double_check == TRUE.

Value

  • If double_check == FALSE, return the input df with a new column named "match".

  • If double_check == TRUE, return 2 data frames (the input df and df_manual_check). Syntax %<-% must be used in this case to have the function return 2 data frames.

Details

  • Records with missing information (i.e., NA) in match_by column won't be modified.

  • Double check criteria: Within each identified duplicate set, if check_by is the same, remain as duplicates without review; if check_by is different, output the duplicate set in df_manual_check for manual review.

Examples

library(zeallot)  # to use `%<-%`

# load example dataset
data(bib_example_small)

# text normalization of the data frame
df <- norm_df(bib_example_small)

# example 1:
df_1 <- dup_find_exact(df, match_by = "doi_norm", double_check = FALSE)
#> Error in df_no %>% relocate(match, .after = last_col()): could not find function "%>%"
# Alternatively, %<-% will also work
df_2 %<-% dup_find_exact(df, match_by = "doi_norm", double_check = FALSE)
#> Error in df_no %>% relocate(match, .after = last_col()): could not find function "%>%"

# example 2:
c(df, df_manual_check) %<-% dup_find_exact(df, match_by = "title_norm", double_check = TRUE, check_by = "first_author_last_name_norm")  # Syntax %<-% must be used in this case to have the function return 2 data frames.
#> Error in df_no %>% relocate(match, .after = last_col()): could not find function "%>%"