RefDeduR tutorial
A text-normalization and decision-tree aided R package enabling accurate and high-throughput reference deduplication for large datasets
Jiaxian Shen, Fangqiong Ling, Erica M. Hartmann
RefDeduR_tutorial.Rmd
✏️ If you use RefDeduR, please cite: https://www.biorxiv.org/content/10.1101/2022.09.29.510210v1
❓ If you run into issues with the package, please open an issue at https://github.com/jxshen311/RefDeduR or email jiaxianshen2022@u.northwestern.edu.
Introduction
As the scientific literature grows exponentially and research becomes increasingly interdisciplinary, accurate and high-throughput reference deduplication is vital in evidence synthesis studies (e.g., systematic reviews, meta-analyses) to ensure the completeness of datasets while reduce the manual screening burden. To address these emerging needs, we developed RefDeduR. We modularize the deduplication pipeline into finely-tuned text normalization, three-step exact matching, and two-step fuzzy matching processes. The package features a decision-tree algorithm and considers preprints and conference proceedings when they co-exist with a peer-reviewed version.
Below, we demonstrate the functionality of RefDeduR with an example pipeline.
Example dataset
We use an example dataset to demonstrate the recommended pipeline of RefDeduR. The dataset contains all bibliographic records (n = 6384) retrieved in a systematic review on indoor surface microbiome studies. The systematic search was conducted on 2022-01-10 through 3 platforms (i.e., PubMed, Web of Science, and Scopus).
Pre-processing: transliterate non-ASCII characters
The transliteration process includes 2 parts: (1) transliterate common Greek letters to their names (e.g., α to alpha, β to beta) and (2) transliterate accented characters to ASCII characters (e.g., á to a, ä to a).
Rationale: This increases the chance of successful
deduplication by exact matching. This also reduces noises when
partitioning the dataset by the first 2 letters of
first_author_last_name_norm
at the fuzzy matching step. For
example, a record titled “Carriage and population genetics of extended
spectrum β-lactamase-producing Escherichia coli in cats
and dogs in New Zealand” sometimes has the title “Carriage and
population genetics of extended spectrum
beta-lactamase-producing Escherichia coli in cats and
dogs in New Zealand”. Author names “Álvarez-Fraga, L. and Pérez, A.” are
sometimes “Alvarez-Fraga, L. and Perez, A.”.
# Get the path to the example dataset
input_file <- system.file("extdata", "dataset_raw.bib", package = "RefDeduR")
# Specify the path to the output file. Here we put it in the same directory but you can modify the path to wherever you want to store the output file.
transliterated_file <- system.file("extdata", "dataset_transliterated.bib", package = "RefDeduR")
norm_transliteration(input_file, transliterated_file, method = c("greek_letter-name", "any-ascii"))
⚒️ Alternatively, python scripts developed on the basis of
unidecode
package are provided. The scripts are ready to be run in terminal. Performance of the R function and the python scripts are generally similar, with only little difference induced by the difference of R packagestringi
and python packageunidecode
.# Transliterate common Greek letters to their names python transliteration_greek_to_name.py <path/to/input_file> <path/to/output_file> # Transliterate accented characters to ASCII characters python transliteration_unaccent.py <path/to/input_file> <path/to/output_file>
Read the bibliographic file into a data frame
We use function revtools::read_bibliography()
to read
the transliterated bibTex file into a data frame.
We recommend using bibTex files here. According to past experience, reading .ris file seems to result in formatting errors.
Alternative function:
synthesisr::read_refs()
🗒️ Comparison of the two import functions:
synthesisr::read_refs()
seems better at parsing special characters. “β-α-β” can be retained, while the text becomes “β-α-β” when usingrevtools::read_bibliography()
. However, a potential benefit ofrevtools::read_bibliography()
is that it keeps the citation key (e.g., “RN13774” in the first row of record “@article{RN13774,”) in a column named “label”. If the .bib file is exported from Endnote (the case of this example dataset), this citation key can serve as a unique identifier. This information is also preserved in Covidence export. Covidence is an online systematic review management platform, which is a typical downstream step following reference deduplication. If preserving the citation key (or a unique identifier) across processes is desired, consider switching torevtools::read_bibliography()
or importing twice with both functions and combine the data frames.
Here we use revtools::read_bibliography()
because we
have transliterated the Greek letters and we want the unique
identifier.
# Read the transliterated bibTex file into a data frame
b <- revtools::read_bibliography(transliterated_file) # 6384 rows
# We can check the number of missing values in each column.
# Pay attention to `title` column as we expect all records to have titles.
# If your dataset has only a few NAs in title, maybe it is worth resolving the missing values manually. If your dataset has a substantial number of NAs in title (according to our experience, this is extremely rare), consider sub-setting the dataset and deduplicating separately.
colSums(is.na(b))
Text cleaning and normalization
Before deduplication, we first apply multiple finely-tuned text cleaning and normalization to the dataset. A finer text normalization increases the chance of successful deduplication at the exact matching step, where both accuracy and confidence are assured.
This step includes not only standard text normalization such as
converting letters to lowercase, but also tailored operations in
response to patterns we observed, such as removing trademark “(TM)” in
title
, removing English stop words in journal
,
and removing publisher/citation information in abstract
.
Additionally, we extract helper columns which we will use downstream.
See details in each norm_
and extract_
functions’ documentation pages.
b <- norm_df(b)
# This function `norm_df` wraps all (1) text normalization and (2) helper field extraction that are needed.
# By default, expect the function to add 8 more columns compared with the original data frame.
# Alternatively, if you want to customize the normalization operations, refer to its sub-functions by `?norm_df` or hack the source code.
Deduplicate by exact matching
We suggest first deduplicating by exact matching based on 1) “doi_norm”, 2) “title”, and 3) “title_norm” in order. DOI is decisive (i.e., unique to a publication). Title is also highly selective.
Note that here we assume different research papers wouldn’t have 100% identical titles before text normalization. This assumption should hold in most normal cases as indicated previously (1) (2).
The assumption may not apply to special publication types in studies that heavily focus on clinical therapies. For example, we observed that identical titles present in the deduplicated dataset in this paper.
First, we deduplicate based on “doi_norm” and “title”.
# We remove the identified duplicates without manual review because this is fairly conservative.
b1 <- dedu_exact(b, match_by = c("doi_norm", "title"))
# The most recent version will be retained at removal.
Then, we deduplicate based on “title_norm”
To make sure that we don’t delete unique records, we introduce a
double_check
mechanism here and output the duplicate sets
with different check_by
(defaults to
"first_author_last_name"
) to b1_manual_check
for manual review.
Usually, the number of duplicate sets requires review is very small (e.g., in this case, only 1 set needs to be reviewed).
It is worth noting that incorporating the double_check
mechanism here is extremely conservative. If double checking is not
needed, you can incorporate "title_norm"
into
dedu_exact()
.
c(b1, b1_manual_check) %<-% dup_find_exact(b1, match_by = "title_norm", double_check = TRUE, check_by = "first_author_last_name_norm") # Syntax %<-% must be used in this case to have the function return 2 data frames.
If b1_manual_check
is empty, nothing needs to be
manually reviewed. Otherwise, you can either (1) review it in a data
frame format (e.g., preview in R or write.xlsx()
) or (2)
call revtools shiny app to review.
Note: It seems that
revtools::screen_duplicates()
can only display duplicate pairs correctly (i.e., it only displays 2 records of a duplicate set with more than 2 records). So we recommend trying option (1) first.Command to call revtools shiny app:
revtools::screen_duplicates(as.data.frame(b1_manual_check))
If the records reviewed are duplicates, we can proceed to removing
duplicates. Otherwise, if you find a record unique, we can mark them by
modifying their match
number.
Using
match == 3011
andfirst_author_last_name_norm == "de Oliveira"
as an example, runb1$match[which(b1$match == 3011 & b1$first_author_last_name_norm == "de Oliveira")] <- max(b1$match)+1
Alternatively, we can use function
synthesisr::override_duplicates()
:b1$match <- synthesisr::override_duplicates(b1$match, 3011)
. Note that this only works for duplicate pairs. If the duplicate set has more than 2 records, this can only mark the final record unique.
Once we finalize match
, we can remove duplicates.
# In this example, no unique record is found in b1_manual_check.
b2 <- b1[!duplicated(b1$match), ]
# The most recent version will be retained at removal.
# In order not to interfere downstream processes, we remove the "match" column
b2 <- select(b2, -match)
🗒️ Although it is not included in this standard pipeline, you can try further performing exact matching based on “abstract_norm”.
Deduplicate by fuzzy matching
After we remove all duplicates by the high-confidence exact matching, we now proceed to fuzzy matching. Fuzzy matching is made by calculating string similarity based on Levenshtein edit distance.
Two major practical challenges of making the fuzzy-matching deduplication process both accurate and high-throughput are
How do we choose a sensible cutoff threshold for the similarity score?
How do we accelerate the “manual review” step and reduce burden of manual screening?
We propose two strategies to address these challenges correspondingly.
We examine the similarity distribution plots and use the inflection point of the curve as the sensible cutoff threshold. This is a dataset-aware method, and it allows fine-tuning of the cutoff threshold.
We introduce a decision tree that incorporates multiple fields to semi-automate the “manual review” step. This is especially helpful for large datasets, in which case the number of duplicate sets requiring manual review could be unfeasibly high (e.g., revtools output ~1,400 duplicate sets for manual confirmation when treating this example dataset).
To improve the computational efficiency, we divide this process into
2 steps: (1) order the records alphabetically according to
title_norm
and compare only between the adjacent rows; (2)
perform pairwise comparisons between records within the same group
partitioned by the first 2 letters of
first_author_last_name_norm
.
Part 1: order + adjacent
Firstly, we calculate string similarity between adjacent rows for
columns "title_norm"
and "abstract_norm"
.
c(b2, b2_simi) %<-% simi_order_adj(b2, order_by = "title_norm")
# Computing time estimation: ~ 47 sec for this data frame (3837 rows) on a Macbook Pro (Apple M1 Pro chip basic model, memory: 16 GB).
Then we plot similarity distributions of normalized title and abstract to choose cutoffs.
# Distribution of similarity scores based on normalized title
p_b2_ti <- plot_simi_dist(b2_simi, "title_simi")
p_b2_ti # show plot in the Plots tab
# Distribution of similarity scores based on normalized abstract
p_b2_ab <- plot_simi_dist(b2_simi, "abstract_simi")
p_b2_ab # show plot in the Plots tab
The plots suggest a cutoff score of 0.7 or 0.6 for the title and 0.3
for the abstract. For demonstration purpose, we use 0.7 and 0.3 here.
The selected cutoffs are then passed to
dup_find_fuzzy_adj()
to locate potential duplicates. The
function outputs 2 data frames: (1) the input data frame b2
with "match"
column added and (2) a data frame listing
id
of duplicate pairs (id_dup_pair_adj
).
🗒️ Note that the inflection point serves more like a number to begin with. The result should usually be satisfactory, but you may tweak the values to see if the performance can be further improved.
c(b2, id_dup_pair_adj) %<-% dup_find_fuzzy_adj(b2, b2_simi, cutoff_title = 0.7, cutoff_abstract = 0.3)
Per the 2nd strategy, we introduce the decision tree to semi-automate
the “manual review” step. Decisions are added to the
"decision"
column in id_dup_pair
. There could
be 3 levels of decisions, “duplicate”, “not duplicate”, and “check”. If
the decision is “not duplicate”, "match"
column in
df
will be modified. To ensure a high accuracy, especially
a low false positive rate, output “check” is kept in the decision
tree.
c(b2, id_dup_pair_adj) %<-% decision_tree_adj(b2, id_dup_pair_adj)
Once we get the algorithm-generated decisions, we can deduplicate accordingly for different scenarios.
# For the "duplicate", we can just deduplicate by `dup_rm_adj()`.
b2_inter <- dup_rm_adj(b2, id_dup_pair_adj)
# For the "check", we call revtools shiny app to review the duplicate pairs.
# In the app, we select "Yes" for "Is there a variable describing duplicates in this dataset?" and "match" for "Select column containing duplicate data".
# At this step, we finish removing duplicates or keeping record pairs in the app. We click "Not duplicates" if the pair is not duplicated. Or we click "Select Entry #1" or "Select Entry #2" to keep one of the two.
# After reviewing all potential duplicates, don't forget to click "Save Data" and "Exit App" to return the results to the R workspace. In this case, the results will be returned to variable `b3`.
# See revtools tutorial for more instructions: https://revtools.net/deduplication.html
b3 <- revtools::screen_duplicates(b2_inter)
# remove helper columns
b3 <- select(b3, -c(id, match, matches))
Part 2: partition + pairwise
We further look for potential duplicates according to pairwise string
similarity between all records within the same partitioned group for
columns "title_norm"
and "abstract_norm"
.
By default, we partition the dataset by the first 2 letters of
first_author_last_name_norm
. This is more efficient than
another popular partitioning parameter - year - for datasets that are
skewed towards recent years. Additionally, with the prevalence of
preprints, partitioning by "year"
becomes less accurate.
Nevertheless, you can customize the partitioning parameter by
preference.
Following a pipeline similar to that in part 1, we first calculate string similarity. Because we partition the dataset, results are now stored in lists (as compared to data frames in part 1).
c(ls_b3, ls_b3_simi) %<-% simi_ptn_pair(b3, partition_by = "first_two_letters_first_author_last_name")
# Computing time estimation: ~ 23 min for this data frame (3832 rows) on a Macbook Pro (Apple M1 Pro chip basic model, memory: 16 GB). You can consider running it on a high performance computing cluster if shortening the running time is of high priority.
Then we flag potential duplicates.
id_dup_pair_pairwise <- dup_find_fuzzy_pairwise(ls_b3, ls_b3_simi, cutoff_title = 0.7, cutoff_abstract = 0.7)
🗒️ The cutoff thresholds can be inherited from part 1. To avoid over-deleting unique records, we suggest tightening the cutoff of abstract similarity to 0.7 (or 0.6) in this step, as opposed to 0.3 in part 1, where the risk is mitigated by the more restricted ordering (in contrast to the exhaustive pairwise comparison).
We then apply the decision tree to potential duplicates.
id_dup_pair_pairwise <- decision_tree_pairwise(ls_b3, id_dup_pair_pairwise)
For the duplicate pairs with a “check” decision, we output them into a data frame for manual review.
df_check_pairwise <- dup_screen_pairwise(ls_b3, id_dup_pair_pairwise)
Similarly, we can either (1) review them directly in a data frame
format (e.g., preview in R or write.xlsx()) or (2) call revtools shiny
app for visulization by
revtools::screen_duplicates(df_check_pairwise)
. However, we
don’t resolve duplicates directly in the revtools shiny app. Instead, we
use dup_resolve_pairwise()
to change decisions from “check”
to “duplicate” or “not duplicate” according to the manual review
results.
# All 4 duplicate pairs in this example dataset are "not duplicate".
id_dup_pair_pairwise <- dup_resolve_pairwise(
id_dup_pair_pairwise,
df_check_pairwise,
match_index = c(1, 2, 3, 4),
result = "not duplicate")
Afterwards, we remove duplicates by
dup_rm_pairwise()
.
b4 <- dup_rm_pairwise(ls_b3, id_dup_pair_pairwise, to_dataframe = TRUE)
# remove helper columns
b4 <- select(b4, -id, -partition)
Export the deduplicated dataset to .bib or .ris formats
revtools::write_bibliography(b4, "inst/extdata/dataset_deduplicated.ris", format = "ris") # export .ris
# or export .bib
#// revtools::write_bibliography(b4, "inst/extdata/dataset_deduplicated.bib", format = "bib")
Alternative function: synthesisr::write_refs()
.
According to our observation,
revtools::write_bibliography()
seems to preserve more
fields in the exported file, but it’s always a good idea to test both
functions on your own dataset.
🗒️ We export .ris file here because the downstream applications (e.g., we tested Covidence , Rayyan, and Endnote) in our pipeline seem easier to recognize a .ris file than a .bib file.