HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Remove all characters except

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
removecharactersexceptall

Problem

My code takes a string and replaces all characters which are not:

  • English letters



  • Numbers



  • , / -



I have tested it and it seems to generally work well enough. But it may have some catastrophic bug in it and/or can be simplified.

x <- "dog/John is a cutting-edge pilot^¢„þ"
gsub("[^a-zA-Z0-9,-:space:]+", " ", x, perl = TRUE)
"dog/John is a cutting-edge pilot "

Solution

The :space: portion of the regex makes no sense, and probably does not do what you intend.

> x  gsub("[^a-zA-Z0-9,-:space:]", " ", x, perl = TRUE)
[1] "abc:def."


Notice that the colon and period are still present after the substitution.

In fact, inside the character class, ,-: means "all characters with ASCII codes from 44 (the comma) up to 58 (the colon)".

A literal hyphen must be the first or the last character in a character class; otherwise, it is treated as a range (like A-Z).

If you want a character class for whitespace, use "\\s" or [:space:].

So, if you wanted to convert all consecutive strings of junk to a single space, preserving only letters, digits, commas, slashes, hyphens, and whitespace, you could write:

gsub("[^-,/a-zA-Z0-9[:space:]]+", " ", x, perl = TRUE)


or

gsub("[^-,/a-zA-Z0-9\\s]+", " ", x, perl = TRUE)

Code Snippets

> x <- "abc:def."
> gsub("[^a-zA-Z0-9,-:space:]", " ", x, perl = TRUE)
[1] "abc:def."
gsub("[^-,/a-zA-Z0-9[:space:]]+", " ", x, perl = TRUE)
gsub("[^-,/a-zA-Z0-9\\s]+", " ", x, perl = TRUE)

Context

StackExchange Code Review Q#157350, answer score: 5

Revisions (0)

No revisions yet.