RefDeduR tutorial

✏️ If you use RefDeduR, please cite: https://www.biorxiv.org/content/10.1101/2022.09.29.510210v1

❓ If you run into issues with the package, please open an issue at https://github.com/jxshen311/RefDeduR or email jiaxianshen2022@u.northwestern.edu.

library(RefDeduR)

library(dplyr)
library(zeallot)  # to use `%<-%`

Introduction

As the scientific literature grows exponentially and research becomes increasingly interdisciplinary, accurate and high-throughput reference deduplication is vital in evidence synthesis studies (e.g., systematic reviews, meta-analyses) to ensure the completeness of datasets while reduce the manual screening burden. To address these emerging needs, we developed RefDeduR. We modularize the deduplication pipeline into finely-tuned text normalization, three-step exact matching, and two-step fuzzy matching processes. The package features a decision-tree algorithm and considers preprints and conference proceedings when they co-exist with a peer-reviewed version.

Below, we demonstrate the functionality of RefDeduR with an example pipeline.

Example dataset

We use an example dataset to demonstrate the recommended pipeline of RefDeduR. The dataset contains all bibliographic records (n = 6384) retrieved in a systematic review on indoor surface microbiome studies. The systematic search was conducted on 2022-01-10 through 3 platforms (i.e., PubMed, Web of Science, and Scopus).

Pre-processing: transliterate non-ASCII characters

The transliteration process includes 2 parts: (1) transliterate common Greek letters to their names (e.g., α to alpha, β to beta) and (2) transliterate accented characters to ASCII characters (e.g., á to a, ä to a).

Rationale: This increases the chance of successful deduplication by exact matching. This also reduces noises when partitioning the dataset by the first 2 letters of first_author_last_name_norm at the fuzzy matching step. For example, a record titled “Carriage and population genetics of extended spectrum β-lactamase-producing Escherichia coli in cats and dogs in New Zealand” sometimes has the title “Carriage and population genetics of extended spectrum beta-lactamase-producing Escherichia coli in cats and dogs in New Zealand”. Author names “Álvarez-Fraga, L. and Pérez, A.” are sometimes “Alvarez-Fraga, L. and Perez, A.”.

# Get the path to the example dataset
input_file <- system.file("extdata", "dataset_raw.bib", package = "RefDeduR")

# Specify the path to the output file. Here we put it in the same directory but you can modify the path to wherever you want to store the output file.
transliterated_file <- system.file("extdata", "dataset_transliterated.bib", package = "RefDeduR")

norm_transliteration(input_file, transliterated_file, method = c("greek_letter-name", "any-ascii"))

⚒️ Alternatively, python scripts developed on the basis of unidecode package are provided. The scripts are ready to be run in terminal. Performance of the R function and the python scripts are generally similar, with only little difference induced by the difference of R package stringi and python package unidecode.
# Transliterate common Greek letters to their names
python transliteration_greek_to_name.py <path/to/input_file> <path/to/output_file>

# Transliterate accented characters to ASCII characters 
python transliteration_unaccent.py <path/to/input_file> <path/to/output_file>

Read the bibliographic file into a data frame

We use function revtools::read_bibliography() to read the transliterated bibTex file into a data frame.

We recommend using bibTex files here. According to past experience, reading .ris file seems to result in formatting errors.
Alternative function: synthesisr::read_refs()
🗒️ Comparison of the two import functions:

synthesisr::read_refs() seems better at parsing special characters. “β-α-β” can be retained, while the text becomes “Î²-Î±-Î²” when using revtools::read_bibliography(). However, a potential benefit of revtools::read_bibliography() is that it keeps the citation key (e.g., “RN13774” in the first row of record “@article{RN13774,”) in a column named “label”. If the .bib file is exported from Endnote (the case of this example dataset), this citation key can serve as a unique identifier. This information is also preserved in Covidence export. Covidence is an online systematic review management platform, which is a typical downstream step following reference deduplication. If preserving the citation key (or a unique identifier) across processes is desired, consider switching to revtools::read_bibliography() or importing twice with both functions and combine the data frames.

Here we use revtools::read_bibliography() because we have transliterated the Greek letters and we want the unique identifier.

# Read the transliterated bibTex file into a data frame
b <- revtools::read_bibliography(transliterated_file)  # 6384 rows

# We can check the number of missing values in each column.
# Pay attention to `title` column as we expect all records to have titles.
# If your dataset has only a few NAs in title, maybe it is worth resolving the missing values manually. If your dataset has a substantial number of NAs in title (according to our experience, this is extremely rare), consider sub-setting the dataset and deduplicating separately.
colSums(is.na(b))

Text cleaning and normalization

Before deduplication, we first apply multiple finely-tuned text cleaning and normalization to the dataset. A finer text normalization increases the chance of successful deduplication at the exact matching step, where both accuracy and confidence are assured.

This step includes not only standard text normalization such as converting letters to lowercase, but also tailored operations in response to patterns we observed, such as removing trademark “(TM)” in title, removing English stop words in journal, and removing publisher/citation information in abstract. Additionally, we extract helper columns which we will use downstream. See details in each norm_ and extract_ functions’ documentation pages.

b <- norm_df(b)
# This function `norm_df` wraps all (1) text normalization and (2) helper field extraction that are needed.
# By default, expect the function to add 8 more columns compared with the original data frame.
# Alternatively, if you want to customize the normalization operations, refer to its sub-functions by `?norm_df` or hack the source code.

Deduplicate by exact matching

We suggest first deduplicating by exact matching based on 1) “doi_norm”, 2) “title”, and 3) “title_norm” in order. DOI is decisive (i.e., unique to a publication). Title is also highly selective.

Note that here we assume different research papers wouldn’t have 100% identical titles before text normalization. This assumption should hold in most normal cases as indicated previously (1) (2).

The assumption may not apply to special publication types in studies that heavily focus on clinical therapies. For example, we observed that identical titles present in the deduplicated dataset in this paper.

First, we deduplicate based on “doi_norm” and “title”.

# We remove the identified duplicates without manual review because this is fairly conservative.
b1 <- dedu_exact(b, match_by = c("doi_norm", "title"))
# The most recent version will be retained at removal.

Then, we deduplicate based on “title_norm”

To make sure that we don’t delete unique records, we introduce a double_check mechanism here and output the duplicate sets with different check_by (defaults to "first_author_last_name") to b1_manual_check for manual review.

Usually, the number of duplicate sets requires review is very small (e.g., in this case, only 1 set needs to be reviewed).

It is worth noting that incorporating the double_check mechanism here is extremely conservative. If double checking is not needed, you can incorporate "title_norm" into dedu_exact().

c(b1, b1_manual_check) %<-% dup_find_exact(b1, match_by = "title_norm", double_check = TRUE, check_by = "first_author_last_name_norm")  # Syntax %<-% must be used in this case to have the function return 2 data frames.

If b1_manual_check is empty, nothing needs to be manually reviewed. Otherwise, you can either (1) review it in a data frame format (e.g., preview in R or write.xlsx()) or (2) call revtools shiny app to review.

Note: It seems that revtools::screen_duplicates() can only display duplicate pairs correctly (i.e., it only displays 2 records of a duplicate set with more than 2 records). So we recommend trying option (1) first.
Command to call revtools shiny app: revtools::screen_duplicates(as.data.frame(b1_manual_check))

If the records reviewed are duplicates, we can proceed to removing duplicates. Otherwise, if you find a record unique, we can mark them by modifying their match number.

Using match == 3011 and first_author_last_name_norm == "de Oliveira" as an example, run b1$match[which(b1$match == 3011 & b1$first_author_last_name_norm == "de Oliveira")] <- max(b1$match)+1
Alternatively, we can use function synthesisr::override_duplicates(): b1$match <- synthesisr::override_duplicates(b1$match, 3011). Note that this only works for duplicate pairs. If the duplicate set has more than 2 records, this can only mark the final record unique.

Once we finalize match, we can remove duplicates.

# In this example, no unique record is found in b1_manual_check.
b2 <- b1[!duplicated(b1$match), ]
# The most recent version will be retained at removal.

# In order not to interfere downstream processes, we remove the "match" column
b2 <- select(b2, -match)

🗒️ Although it is not included in this standard pipeline, you can try further performing exact matching based on “abstract_norm”.

Deduplicate by fuzzy matching

After we remove all duplicates by the high-confidence exact matching, we now proceed to fuzzy matching. Fuzzy matching is made by calculating string similarity based on Levenshtein edit distance.

Two major practical challenges of making the fuzzy-matching deduplication process both accurate and high-throughput are

How do we choose a sensible cutoff threshold for the similarity score?
How do we accelerate the “manual review” step and reduce burden of manual screening?

We propose two strategies to address these challenges correspondingly.

We examine the similarity distribution plots and use the inflection point of the curve as the sensible cutoff threshold. This is a dataset-aware method, and it allows fine-tuning of the cutoff threshold.
We introduce a decision tree that incorporates multiple fields to semi-automate the “manual review” step. This is especially helpful for large datasets, in which case the number of duplicate sets requiring manual review could be unfeasibly high (e.g., revtools output ~1,400 duplicate sets for manual confirmation when treating this example dataset).

To improve the computational efficiency, we divide this process into 2 steps: (1) order the records alphabetically according to title_norm and compare only between the adjacent rows; (2) perform pairwise comparisons between records within the same group partitioned by the first 2 letters of first_author_last_name_norm.

Part 1: order + adjacent

Firstly, we calculate string similarity between adjacent rows for columns "title_norm" and "abstract_norm".

c(b2, b2_simi) %<-% simi_order_adj(b2, order_by = "title_norm")
# Computing time estimation: ~ 47 sec for this data frame (3837 rows) on a Macbook Pro (Apple M1 Pro chip basic model, memory: 16 GB).

Then we plot similarity distributions of normalized title and abstract to choose cutoffs.

# Distribution of similarity scores based on normalized title
p_b2_ti <- plot_simi_dist(b2_simi, "title_simi")
p_b2_ti  # show plot in the Plots tab

# Distribution of similarity scores based on normalized abstract
p_b2_ab <- plot_simi_dist(b2_simi, "abstract_simi")
p_b2_ab  # show plot in the Plots tab

The plots suggest a cutoff score of 0.7 or 0.6 for the title and 0.3 for the abstract. For demonstration purpose, we use 0.7 and 0.3 here. The selected cutoffs are then passed to dup_find_fuzzy_adj() to locate potential duplicates. The function outputs 2 data frames: (1) the input data frame b2 with "match" column added and (2) a data frame listing id of duplicate pairs (id_dup_pair_adj).

🗒️ Note that the inflection point serves more like a number to begin with. The result should usually be satisfactory, but you may tweak the values to see if the performance can be further improved.

c(b2, id_dup_pair_adj) %<-% dup_find_fuzzy_adj(b2, b2_simi, cutoff_title = 0.7, cutoff_abstract = 0.3)

Per the 2nd strategy, we introduce the decision tree to semi-automate the “manual review” step. Decisions are added to the "decision" column in id_dup_pair. There could be 3 levels of decisions, “duplicate”, “not duplicate”, and “check”. If the decision is “not duplicate”, "match" column in df will be modified. To ensure a high accuracy, especially a low false positive rate, output “check” is kept in the decision tree.

c(b2, id_dup_pair_adj) %<-% decision_tree_adj(b2, id_dup_pair_adj)

Once we get the algorithm-generated decisions, we can deduplicate accordingly for different scenarios.

# For the "duplicate", we can just deduplicate by `dup_rm_adj()`. 
b2_inter <- dup_rm_adj(b2, id_dup_pair_adj)

# For the "check", we call revtools shiny app to review the duplicate pairs.
# In the app, we select "Yes" for "Is there a variable describing duplicates in this dataset?" and "match" for "Select column containing duplicate data".
# At this step, we finish removing duplicates or keeping record pairs in the app. We click "Not duplicates" if the pair is not duplicated. Or we click "Select Entry #1" or "Select Entry #2" to keep one of the two.
# After reviewing all potential duplicates, don't forget to click "Save Data" and "Exit App" to return the results to the R workspace. In this case, the results will be returned to variable `b3`.
# See revtools tutorial for more instructions: https://revtools.net/deduplication.html
b3 <- revtools::screen_duplicates(b2_inter)  


# remove helper columns
b3 <- select(b3, -c(id, match, matches))

Part 2: partition + pairwise

We further look for potential duplicates according to pairwise string similarity between all records within the same partitioned group for columns "title_norm" and "abstract_norm".

By default, we partition the dataset by the first 2 letters of first_author_last_name_norm. This is more efficient than another popular partitioning parameter - year - for datasets that are skewed towards recent years. Additionally, with the prevalence of preprints, partitioning by "year" becomes less accurate. Nevertheless, you can customize the partitioning parameter by preference.

Following a pipeline similar to that in part 1, we first calculate string similarity. Because we partition the dataset, results are now stored in lists (as compared to data frames in part 1).

c(ls_b3, ls_b3_simi) %<-% simi_ptn_pair(b3, partition_by = "first_two_letters_first_author_last_name")
# Computing time estimation: ~ 23 min for this data frame (3832 rows) on a Macbook Pro (Apple M1 Pro chip basic model, memory: 16 GB). You can consider running it on a high performance computing cluster if shortening the running time is of high priority.

Then we flag potential duplicates.

id_dup_pair_pairwise <- dup_find_fuzzy_pairwise(ls_b3, ls_b3_simi, cutoff_title = 0.7, cutoff_abstract = 0.7)

🗒️ The cutoff thresholds can be inherited from part 1. To avoid over-deleting unique records, we suggest tightening the cutoff of abstract similarity to 0.7 (or 0.6) in this step, as opposed to 0.3 in part 1, where the risk is mitigated by the more restricted ordering (in contrast to the exhaustive pairwise comparison).

We then apply the decision tree to potential duplicates.

id_dup_pair_pairwise <- decision_tree_pairwise(ls_b3, id_dup_pair_pairwise)

For the duplicate pairs with a “check” decision, we output them into a data frame for manual review.

df_check_pairwise <- dup_screen_pairwise(ls_b3, id_dup_pair_pairwise)

Similarly, we can either (1) review them directly in a data frame format (e.g., preview in R or write.xlsx()) or (2) call revtools shiny app for visulization by revtools::screen_duplicates(df_check_pairwise). However, we don’t resolve duplicates directly in the revtools shiny app. Instead, we use dup_resolve_pairwise() to change decisions from “check” to “duplicate” or “not duplicate” according to the manual review results.

# All 4 duplicate pairs in this example dataset are "not duplicate".
id_dup_pair_pairwise <- dup_resolve_pairwise(
  id_dup_pair_pairwise,
  df_check_pairwise,
  match_index = c(1, 2, 3, 4),
  result = "not duplicate")

Afterwards, we remove duplicates by dup_rm_pairwise().

b4 <- dup_rm_pairwise(ls_b3, id_dup_pair_pairwise, to_dataframe = TRUE)

# remove helper columns
b4 <- select(b4, -id, -partition)

Export the deduplicated dataset to .bib or .ris formats

revtools::write_bibliography(b4, "inst/extdata/dataset_deduplicated.ris", format = "ris")  # export .ris
# or export .bib
#// revtools::write_bibliography(b4, "inst/extdata/dataset_deduplicated.bib", format = "bib")

Alternative function: synthesisr::write_refs(). According to our observation, revtools::write_bibliography() seems to preserve more fields in the exported file, but it’s always a good idea to test both functions on your own dataset.

🗒️ We export .ris file here because the downstream applications (e.g., we tested Covidence , Rayyan, and Endnote) in our pipeline seem easier to recognize a .ris file than a .bib file.

A text-normalization and decision-tree aided R package enabling accurate and high-throughput reference deduplication for large datasets

Jiaxian Shen, Fangqiong Ling, Erica M. Hartmann