Clean and normalize author in bibliography
norm_author.Rd
For author, we do the following string normalization.
replace “and” with “&” (Reduce the effect on dissimilarity)
remove all space (1. Remove extra whitespace 2. Reduce the effect of space-caused dissimilarity)
convert letters to lowercase
Arguments
- author
A character vector (e.g., a column in a data frame)
- rm_punctuation
Logical: Does unaccenting characters introduce extra punctuation? If so, these need to be removed. Defaults to
FALSE
.Using
bash iconv
to unaccent characters will introduce extra punctuation (e.g., '`^~\"). If using bash iconv, these punctuation needs to be removed as well. Since we use python now, this is not needed by default.
Examples
# Example 1
author <- c("Xia, Z. X. and Dai, W. W. and Xiong, J. P. and Hao, Z. P. and Davidson, V. L. and White, S. and Mathews, F. S.",
"Ahmed, M. H. and Koparde, V. N. and Safo, M. K. and Neel Scarsdale, J. and Kellogg, G. E.",
"Whitman, C. P." )
norm_author(author)
#> [1] "xia,z.x.&dai,w.w.&xiong,j.p.&hao,z.p.&davidson,v.l.&white,s.&mathews,f.s."
#> [2] "ahmed,m.h.&koparde,v.n.&safo,m.k.&neelscarsdale,j.&kellogg,g.e."
#> [3] "whitman,c.p."
# Example 2
# é becomes 'e if you use `cat file.bib | iconv -f utf8 -t ascii//TRANSLIT//IGNORE > convert.bib` to unaccent characters. Make `rm_punctuation = TRUE` to remove the extra punctuation intruduced.
author2 <- c("Ren'ee")
norm_author(author2, rm_punctuation = TRUE)
#> [1] "renee"