Clean and normalize abstract in bibliography — norm

For abstract, we do the following string normalization.

Remove tailing information such as ". (C) 1998 International Astronautical Federation Published by Elsevier Science Ltd. All rights reserved." and "(C) 2000 Elsevier Science B.V. All rights reserved." according to 6 patterns observed empirically, to reduce the effect of the information.
convert letters to lowercase
remove whitespace from start and end of string; also reduce repeated whitespace inside the string.

Usage

norm_abstract(abstract, first_author_last_name)

Arguments

abstract

A character vector (e.g., a column in a data frame)

first_author_last_name

A character vector containing last name of first author, or FALSE. By default, I suggest supplying first_author_last_name.

If not FALSE, index of abstract and first_author_last_name must be the same for the same bibliographic record. This should not be an issue if analyzing based on a data frame.

Last name of first author is used as one of the patterns to remove irrelevant information from abstract and clean the text. An example of the information being removed includes ". (c) Daisuke Fujiwara et al., 2021;"

If FALSE, abstract normalization according to this pattern will be bypassed.

Value

Normalized character vector

Examples

data(bib_example_small)

bib_example_small$first_author_last_name <- sapply(stringr::str_split(bib_example_small$author, ",",  n = 2), `[`, 1)


bib_example_small$abstract_norm <- norm_abstract(bib_example_small$abstract, bib_example_small$first_author_last_name)
# or
bib_example_small$abstract_norm <- norm_abstract(bib_example_small$abstract, FALSE)