HiveBrain v1.2.0
Get Started
← Back to all entries
patterncsharpMinor

Combining regex

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
regexcombiningstackoverflow

Problem

How can this be minimized?

// remove accent
byte[] bytes = System.Text.Encoding.UTF8.GetBytes(input);
input = System.Text.Encoding.UTF8.GetString(bytes);

// make it all lower case
input = input.ToLower();

// remove stop words
input = System.Text.RegularExpressions.Regex.Replace(input, "\\b" + string.Join("\\b|\\b", ENGLISH_STOP_WORDS) + "\\b", "");

// remove entities
input = System.Text.RegularExpressions.Regex.Replace(input, @"&\w+;", "");

// remove anything that is not letters, numbers, dash, or space
input = System.Text.RegularExpressions.Regex.Replace(input, @"[^a-z0-9\-\s]", "");

// replace spaces
input = input.Replace(' ', '-');

// collapse dashes
input = System.Text.RegularExpressions.Regex.Replace(input, @"-{2,}", "-");

// collapse spaces
input = System.Text.RegularExpressions.Regex.Replace(input, @"\s+", " ").Trim();

// Trim dashes and spaces
input = input.Trim(' ').Trim('-').Trim(' '); // double trim the spaces incase dashes were covering them

return input;

Solution

// remove accent


Actually, no. The following code is just a lossless conversion to and from UTF-8 which doesn’t change the text.

In the following, I’d coalesce the regular expressions – if nothing else, this is way more efficient. I’d also import the namespace to get rid of this overlong explicit namespace qualification. The “collapse spaces” phase makes no sense since you’ve already removed spaces.

Finally, you can also coalesce the Trim statements.

Ignoring for now that the accent removal doesn’t work, this leaves us with:

input = input.ToLower();

// remove stop words, entities and anything that is not letters, numbers, dash, or space
string stopWords = string.Format("\\b{0}\\b", string.Join("\\b|\\b", ENGLISH_STOP_WORDS));
input = Regex.Replace(input, stopWords + @"|&\w+;|[^a-z0-9\-\s]", "");

// replace spaces
input = input.Replace(' ', '-');

// collapse dashes
input = Regex.Replace(input, @"-{2,}", "-");

// Trim dashes and spaces
input = input.Trim(' ', '-');


Finally, to remove accents, you need to normalize the Unicode string so that accented characters are decomposed into diacritics and remove combining diacritic marks:

static string RemoveDiacritics(string stIn) {
    string stFormD = stIn.Normalize(NormalizationForm.FormD);
    StringBuilder sb = new StringBuilder();

    for(int ich = 0; ich < stFormD.Length; ich++) {
        UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
        if(uc != UnicodeCategory.NonSpacingMark) {
            sb.Append(stFormD[ich]);
        }
    }

    return sb.ToString();
}

Code Snippets

// remove accent
input = input.ToLower();

// remove stop words, entities and anything that is not letters, numbers, dash, or space
string stopWords = string.Format("\\b{0}\\b", string.Join("\\b|\\b", ENGLISH_STOP_WORDS));
input = Regex.Replace(input, stopWords + @"|&\w+;|[^a-z0-9\-\s]", "");

// replace spaces
input = input.Replace(' ', '-');

// collapse dashes
input = Regex.Replace(input, @"-{2,}", "-");

// Trim dashes and spaces
input = input.Trim(' ', '-');
static string RemoveDiacritics(string stIn) {
    string stFormD = stIn.Normalize(NormalizationForm.FormD);
    StringBuilder sb = new StringBuilder();

    for(int ich = 0; ich < stFormD.Length; ich++) {
        UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
        if(uc != UnicodeCategory.NonSpacingMark) {
            sb.Append(stFormD[ich]);
        }
    }

    return sb.ToString();
}

Context

StackExchange Code Review Q#15274, answer score: 4

Revisions (0)

No revisions yet.