Calculate string similarity between adjacent rows

The function calculates similarity based on Levenshtein edit distance for columns "title_norm" and "abstract_norm" between adjacent rows. Range of similarity is [0, 1]. Similarity == 1 means 100% identical while Similarity == 0 means completely different.

Usage

simi_order_adj(df, order_by = "title_norm")

Arguments

df: A data frame with bibliographic information that has gone through text normalization. df must have the following columns c("title_norm", "abstract_norm").
order_by: Quoted name of the column by which to order the rows. Defaults to "title_norm".

Value

Two data frames: (1) Ordered df; (2) A data frame with string similarity results for "title_norm" and "abstract_norm". Both data frames have a matched id column.

Details

This function is based on the assumption that all records have titles.

Computing time estimation according to past experience: ~ 46 seconds for a data frame with 3837 rows on a Macbook Pro (Apple M1 Pro chip basic model, memory: 16 GB).

Examples

if (FALSE) {
# load example dataset
data(bib_example_small)

# text normalization of the data frame
df <- norm_df(bib_example_small)

# calculate similarity
c(df, df_simi) %<-% simi_order_adj(df, order_by = "title_norm")
# df_simi[1, ] stores similarity results between df[1, ] and df[2, ]
}