Skip to contents

For author, we do the following string normalization.

  • replace “and” with “&” (Reduce the effect on dissimilarity)

  • remove all space (1. Remove extra whitespace 2. Reduce the effect of space-caused dissimilarity)

  • convert letters to lowercase

Usage

norm_author(author, rm_punctuation = FALSE)

Arguments

author

A character vector (e.g., a column in a data frame)

rm_punctuation

Logical: Does unaccenting characters introduce extra punctuation? If so, these need to be removed. Defaults to FALSE.

Using bash iconv to unaccent characters will introduce extra punctuation (e.g., '`^~\"). If using bash iconv, these punctuation needs to be removed as well. Since we use python now, this is not needed by default.

Value

Normalized character vector

Examples

# Example 1
author <- c("Xia, Z. X. and Dai, W. W. and Xiong, J. P. and Hao, Z. P. and Davidson, V. L. and White, S. and Mathews, F. S.",
"Ahmed, M. H. and Koparde, V. N. and Safo, M. K. and Neel Scarsdale, J. and Kellogg, G. E.",
"Whitman, C. P." )

norm_author(author)
#> [1] "xia,z.x.&dai,w.w.&xiong,j.p.&hao,z.p.&davidson,v.l.&white,s.&mathews,f.s."
#> [2] "ahmed,m.h.&koparde,v.n.&safo,m.k.&neelscarsdale,j.&kellogg,g.e."          
#> [3] "whitman,c.p."                                                             


# Example 2
# é becomes 'e if you use `cat file.bib | iconv -f utf8 -t ascii//TRANSLIT//IGNORE > convert.bib` to unaccent characters. Make `rm_punctuation = TRUE` to remove the extra punctuation intruduced.
author2 <- c("Ren'ee")

norm_author(author2, rm_punctuation = TRUE)
#> [1] "renee"