Find duplicates by exact match
dup_find_exact.Rd
Find duplicates by exact match
Usage
dup_find_exact(
df,
match_by,
double_check = FALSE,
check_by = "first_author_last_name_norm"
)
Arguments
- df
A data frame with bibliographic information that has gone through text normalization.
df
must have the following 6 columnsc("author", "title", "journal", "abstract", "year", "doi")
.- match_by
Quoted name of the column with information to match by (e.g.,
"doi_norm"
,"title"
,"title_norm"
).- double_check
Logical: Is confirmarion against another column needed? Defaults to
FALSE
.- check_by
Quoted name of the column with information to double check by. Not required/ignored if
double_check == FALSE
. Defaults to"first_author_last_name_norm"
ifdouble_check == TRUE
.
Value
If
double_check == FALSE
, return the inputdf
with a new column named"match"
.If
double_check == TRUE
, return 2 data frames (the inputdf
anddf_manual_check
). Syntax%<-%
must be used in this case to have the function return 2 data frames.
Details
Records with missing information (i.e., NA) in
match_by
column won't be modified.Double check criteria: Within each identified duplicate set, if
check_by
is the same, remain as duplicates without review; ifcheck_by
is different, output the duplicate set indf_manual_check
for manual review.
Examples
library(zeallot) # to use `%<-%`
# load example dataset
data(bib_example_small)
# text normalization of the data frame
df <- norm_df(bib_example_small)
# example 1:
df_1 <- dup_find_exact(df, match_by = "doi_norm", double_check = FALSE)
#> Error in df_no %>% relocate(match, .after = last_col()): could not find function "%>%"
# Alternatively, %<-% will also work
df_2 %<-% dup_find_exact(df, match_by = "doi_norm", double_check = FALSE)
#> Error in df_no %>% relocate(match, .after = last_col()): could not find function "%>%"
# example 2:
c(df, df_manual_check) %<-% dup_find_exact(df, match_by = "title_norm", double_check = TRUE, check_by = "first_author_last_name_norm") # Syntax %<-% must be used in this case to have the function return 2 data frames.
#> Error in df_no %>% relocate(match, .after = last_col()): could not find function "%>%"