HiveBrain v1.2.0
Get Started
← Back to all entries
gotchajavascriptModerate

Using regex to replace strange characters

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
replacestrangeusingcharactersregex

Problem

After importing some products from a csv strange characters have shown up on the page and it would be too much work to manually go to each product and remove them so I made this script to deploy on that product page and remove them.

$(function() {
    var p_desc = $(".rte").html();
    var re = /\?ÕÌ_|Š|š|Ž|ž|À|Á|Â|Ã|Ä|Å|Æ|Ç|È|É|Ê|Ë|Ì|Í|Î|Ï|Ñ|Ò|Ó|Ô|Õ|Ö|Ø|Ù|Ú|Û|Ü|Ý|Þ|ß|à|á|â|ã|ä|å|æ|ç|è|é|ê|ë|ì|í|î|ï|ð|ñ|ò|ó|ô|õ|ö|ø|ù|ú|û|ý|þ|ÿ|_Œ‚|__|_/g;
    var result = p_desc.replace(re, ' ');
    var new_p_desc = result.replace(/[^\x00-\x7F]/g, "").replace(/\?/g, '');
    $(".rte").html(new_p_desc);
});


My script is working fine but not sure if it could be made better. Was this the best way to go about it?

Solution

RegEx Improvements

The regex can be shortened by using case-insensitive match with i flag. We can remove the characters which are added as both lowercase and uppercase in the regex.

After removing lowercase characters regex will be as below

\?ÕÌ_|Š|Ž|À|Á|Â|Ã|Ä|Å|Æ|Ç|È|É|Ê|Ë|Ì|Í|Î|Ï|Ñ|Ò|Ó|Ô|Õ|Ö|Ø|Ù|Ú|Û|Ü|Ý|Þ|ß|ð|ÿ|_Œ‚|__|_


Here's live demo of regex

The regex can be further improved by using character class which will make the matches faster than OR conditions

\?ÕÌ_|_Œ‚|[ŠŽÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÞßðÿ_]+


Adding + quantifier also has positive effect on the number of steps taken to match characters when the characters in the character class are consecutive/adjacent to each other.

Here's the demo on RegEx101, without + quantifierScreenshot and with + quantifierscreenshot applied on the same data. Note that in these demos, PHP is selected as the steps taken to match is not shown for JavaScript. Also, the regex is different, it also contains lowercase counterparts of those special characters as i flag is not working with PHP and don't want to apply u(Unicode) flag as it is not supported in JavaScript.

These demos are created only to show difference when + is applied on character class. The effect should be similar in JavaScript.

Note that the __(two underscores) are redundant as _ is already added in character class and with g flag it'll remove all occurrences.

Method Chaining

As replace returns a string, any other string method can be called on it. Multiple calls to replace can be chained.

str.replace(someRegexOrString, someString)
    .replace(someOtherRegexOrString, someOtherString);


This is equivalent to

var temp = str.replace(someRegexOrString, someString);
var result = temp.replace(someOtherRegexOrString, someOtherString);


Replacing HTML

jQuery html() accepts a function which will receive the current innerHTML of the element on which the method is called as parameter and replaces the returned content to the element.

The code can be written as

$('.rte').html(function(index, currentHTML) {
    return doSomeOperationOn(currentHTML);
});


Complete Code

With above changes, the code will be

$(document).ready(function() {
    var regex = /\?ÕÌ_|_Œ‚|[ŠŽÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÞßðÿ_]+/gi;

    $('.rte').html(function(i, oldHTML) {
        return oldHTML.replace(regex, ' ')
            .replace(/[^\x00-\x7F]|\?/g, '');
    });
});


$(document).ready(function() { is more readable than $(function() {. So, you may also consider using more expressive form.

Code Snippets

\?ÕÌ_|Š|Ž|À|Á|Â|Ã|Ä|Å|Æ|Ç|È|É|Ê|Ë|Ì|Í|Î|Ï|Ñ|Ò|Ó|Ô|Õ|Ö|Ø|Ù|Ú|Û|Ü|Ý|Þ|ß|ð|ÿ|_Œ‚|__|_
\?ÕÌ_|_Œ‚|[ŠŽÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝÞßðÿ_]+
str.replace(someRegexOrString, someString)
    .replace(someOtherRegexOrString, someOtherString);
var temp = str.replace(someRegexOrString, someString);
var result = temp.replace(someOtherRegexOrString, someOtherString);
$('.rte').html(function(index, currentHTML) {
    return doSomeOperationOn(currentHTML);
});

Context

StackExchange Code Review Q#150438, answer score: 11

Revisions (0)

No revisions yet.