HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Replacing accented letters with regular letters in a spreadsheet

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
accentedwithregularspreadsheetreplacingletters

Problem

I am using the following VBA code to replace accented letters with regular letters in a spreadsheet. This is necessary because these spreadsheets have to be uploaded to an import tool that does not allow foreign characters.

Function RemoveAccentsFromForeignLetters()

    StartNewTask ("Removing accents from foreign letters")

    Dim AccChars As String
    Dim RegChars As String

    AccChars = "ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïðñòóôõöùúûüýÿ"
    RegChars = "SZszYAAAAAACEEEEIIIIDNOOOOOUUUUYaaaaaaceeeeiiiidnooooouuuuyy"
    Set MyRange = ActiveSheet.UsedRange

    Dim A As String * 1
    Dim B As String * 1
    Dim i As Integer

    For i = 1 To Len(AccChars)
        A = Mid(AccChars, i, 1)
        B = Mid(RegChars, i, 1)
        MyRange.Replace What:=A, Replacement:=B, LookAt:=xlPart, MatchCase:=True
        ' TODO: highlight changed cells yellow
    Next

End Function


I googled the code from somewhere and it works, but it is a bit slow. In a spreadsheet with 1.5 million cells (7000 rows, 200 columns), it takes 21 seconds to run.

I wanted to look into ways to optimize it, for example:

  • Maybe a RegEx would be faster?



  • Maybe I should pass the entire spreadsheet to a DLL using an array, have the DLL do the replace, then pass it back? One article I read suggested that this is up to 10x faster than using native Excel VBA.



  • Maybe I should use the same DLL trick as above, but add multi-threading?



  • Any other ideas?

Solution

In addition to @Raystafarian's observations, there are a couple of other issues that I see.

  • I would personally put your AccChars and RegChars variables into


Consts, because you never change their values.

  • Your code also requires that AccChars and RegChars are the same


length, and will fail if they aren't. I'd add an assert that tests
that before you do anything that writes to the Worksheet.

  • You generally want to avoid using the Integer type unless you


absolutely need to (for example in an API call). VBA stores them as a
Long regardless of how they are declared.

  • Declaring A and B as fixed-length strings is a bit of premature


optimization that is backfiring. When you pass them to .Replace as
parameters, they are actually being implicitly cast back to
variable length ones.

  • Minor thing, but I prefer the term "stripped" to "regular" - all


characters are "regular" in their native context.

  • Using string functions is not ideal when what you really care about


are individual characters as opposed to a sub-string. VBA allows a
direct assignment of a String to a Byte array, and indexing into
the array is much faster than calling Mid. The performance hit is a
lot higher when you're doing it inside a loop. As a side note, you
should always use the String returning functions that end with '$'
to avoid superfluous casting unless you explicitly require a
Variant type. Mid$ (returns a String) as opposed to Mid
(returns a Variant).

  • Your call to .Replace method is acting on every single cell in your


Range. This is a huge performance hit, because I'm guessing that
not every cell in the entire Worksheet is going to have an accented
character in it. You should really only be concerned with cells that
do. By performing the replacement on every cell, your performance is scaling directly with the number of cells, not the number of
replacements. So, if only 5% of the cells have accented characters
you are still doing 100% of the work. This is where the regular
expression would be useful, but you can't easily pawn that off on
Excel (except maybe with using .Find, which has issues of its own).
A loop would be better - a loop over an array pulled from the Range
would be best.

  • Your "TODO: highlight changed cells yellow" is going to be much more


difficult using the .Replace function, because it would require
storing the state of the entire sheet, then doing a cell-by-cell
comparison. It will be a lot easier to track this concurrently while
you are making changes.

With all of that in mind, I'd do something more like this:

Private Sub RemoveAccentsFromForeignLetters()
    Dim Target As Range
    Set Target = ActiveSheet.UsedRange

    StartNewTask ("Removing accents from foreign letters")

    Dim Values() As Variant
    Values = Target.Value

    Debug.Assert Len(AccentedChars) = Len(StrippedChars)

    Dim FindChars() As Byte
    Dim ReplaceChars() As Byte
    FindChars = AccentedChars
    ReplaceChars = StrippedChars

    Dim AccentedTest As RegExp
    Set AccentedTest = New RegExp
    AccentedTest.Pattern = "[" & AccentedChars & "]"

    Dim index As Long
    Dim character As Long

    Dim col As Long
    Dim row As Long
    For row = 1 To UBound(Values, 1)
        For col = 1 To UBound(Values, 2)
            'Ignore strings that don't require character replacements.
            If AccentedTest.Test(Values(row, col)) Then
                Dim buffer() As Byte
                buffer = StrConv(Values(row, col), vbUnicode)
                'Skip every other character - VBA "Unicode" expansion 
                'inserts nulls there.
                For character = 0 To UBound(buffer) Step 2
                    For index = 0 To UBound(FindChars)
                        If buffer(character) = FindChars(index) Then
                            buffer(character) = ReplaceChars(index)
                        End If
                    Next index
                Next character
                'Highlight changed cells yellow left as an exercise for
                'the reader.
                Values(row, col) = StrConv(buffer, vbFromUnicode)
            End If
        Next col
    Next row

    ActiveSheet.UsedRange = Values
End Sub


Some quick and dirty benchmarks, all done with 2000 rows and 10 columns. In the "worse case" benchmarks, all cells have the value "ŸÀÁÂÃÄÅÇÈÉÊËÌÍÎ". In the "average case" benchmarks, 5% of the cells have the "ŸÀÁÂÃÄÅÇÈÉÊËÌÍÎ" and the rest of them contain "XXXXXXXXXXXXXXX".

Replace method, worst case: 3.51 seconds. 
Array method, worst case: 1.15 seconds. 
Replace method, average case: .40 seconds (supports disabling ScreenUpdating). 
Array method, average case: .08 seconds.

Code Snippets

Private Sub RemoveAccentsFromForeignLetters()
    Dim Target As Range
    Set Target = ActiveSheet.UsedRange

    StartNewTask ("Removing accents from foreign letters")

    Dim Values() As Variant
    Values = Target.Value

    Debug.Assert Len(AccentedChars) = Len(StrippedChars)

    Dim FindChars() As Byte
    Dim ReplaceChars() As Byte
    FindChars = AccentedChars
    ReplaceChars = StrippedChars

    Dim AccentedTest As RegExp
    Set AccentedTest = New RegExp
    AccentedTest.Pattern = "[" & AccentedChars & "]"

    Dim index As Long
    Dim character As Long

    Dim col As Long
    Dim row As Long
    For row = 1 To UBound(Values, 1)
        For col = 1 To UBound(Values, 2)
            'Ignore strings that don't require character replacements.
            If AccentedTest.Test(Values(row, col)) Then
                Dim buffer() As Byte
                buffer = StrConv(Values(row, col), vbUnicode)
                'Skip every other character - VBA "Unicode" expansion 
                'inserts nulls there.
                For character = 0 To UBound(buffer) Step 2
                    For index = 0 To UBound(FindChars)
                        If buffer(character) = FindChars(index) Then
                            buffer(character) = ReplaceChars(index)
                        End If
                    Next index
                Next character
                'Highlight changed cells yellow left as an exercise for
                'the reader.
                Values(row, col) = StrConv(buffer, vbFromUnicode)
            End If
        Next col
    Next row

    ActiveSheet.UsedRange = Values
End Sub
Replace method, worst case: 3.51 seconds. 
Array method, worst case: 1.15 seconds. 
Replace method, average case: .40 seconds (supports disabling ScreenUpdating). 
Array method, average case: .08 seconds.

Context

StackExchange Code Review Q#123820, answer score: 4

Revisions (0)

No revisions yet.