Find duplicates by exact match
dup_find_exact.RdFind duplicates by exact match
Usage
dup_find_exact(
df,
match_by,
double_check = FALSE,
check_by = "first_author_last_name_norm"
)Arguments
- df
A data frame with bibliographic information that has gone through text normalization.
dfmust have the following 6 columnsc("author", "title", "journal", "abstract", "year", "doi").- match_by
Quoted name of the column with information to match by (e.g.,
"doi_norm","title","title_norm").- double_check
Logical: Is confirmarion against another column needed? Defaults to
FALSE.- check_by
Quoted name of the column with information to double check by. Not required/ignored if
double_check == FALSE. Defaults to"first_author_last_name_norm"ifdouble_check == TRUE.
Value
If
double_check == FALSE, return the inputdfwith a new column named"match".If
double_check == TRUE, return 2 data frames (the inputdfanddf_manual_check). Syntax%<-%must be used in this case to have the function return 2 data frames.
Details
Records with missing information (i.e., NA) in
match_bycolumn won't be modified.Double check criteria: Within each identified duplicate set, if
check_byis the same, remain as duplicates without review; ifcheck_byis different, output the duplicate set indf_manual_checkfor manual review.
Examples
library(zeallot) # to use `%<-%`
# load example dataset
data(bib_example_small)
# text normalization of the data frame
df <- norm_df(bib_example_small)
# example 1:
df_1 <- dup_find_exact(df, match_by = "doi_norm", double_check = FALSE)
#> Error in df_no %>% relocate(match, .after = last_col()): could not find function "%>%"
# Alternatively, %<-% will also work
df_2 %<-% dup_find_exact(df, match_by = "doi_norm", double_check = FALSE)
#> Error in df_no %>% relocate(match, .after = last_col()): could not find function "%>%"
# example 2:
c(df, df_manual_check) %<-% dup_find_exact(df, match_by = "title_norm", double_check = TRUE, check_by = "first_author_last_name_norm") # Syntax %<-% must be used in this case to have the function return 2 data frames.
#> Error in df_no %>% relocate(match, .after = last_col()): could not find function "%>%"