Clean and normalize abstract in bibliography
norm_abstract.Rd
For abstract, we do the following string normalization.
Remove tailing information such as ". (C) 1998 International Astronautical Federation Published by Elsevier Science Ltd. All rights reserved." and "(C) 2000 Elsevier Science B.V. All rights reserved." according to 6 patterns observed empirically, to reduce the effect of the information.
convert letters to lowercase
remove whitespace from start and end of string; also reduce repeated whitespace inside the string.
Arguments
- abstract
A character vector (e.g., a column in a data frame)
- first_author_last_name
A character vector containing last name of first author, or
FALSE
. By default, I suggest supplyingfirst_author_last_name
.If not
FALSE
, index ofabstract
andfirst_author_last_name
must be the same for the same bibliographic record. This should not be an issue if analyzing based on a data frame.Last name of first author is used as one of the patterns to remove irrelevant information from abstract and clean the text. An example of the information being removed includes ". (c) Daisuke Fujiwara et al., 2021;"
If
FALSE
, abstract normalization according to this pattern will be bypassed.
Examples
data(bib_example_small)
bib_example_small$first_author_last_name <- sapply(stringr::str_split(bib_example_small$author, ",", n = 2), `[`, 1)
bib_example_small$abstract_norm <- norm_abstract(bib_example_small$abstract, bib_example_small$first_author_last_name)
# or
bib_example_small$abstract_norm <- norm_abstract(bib_example_small$abstract, FALSE)