patternMinor
Replacing accented letters with regular letters in a spreadsheet
Viewed 0 times
accentedwithregularspreadsheetreplacingletters
Problem
I am using the following VBA code to replace accented letters with regular letters in a spreadsheet. This is necessary because these spreadsheets have to be uploaded to an import tool that does not allow foreign characters.
I googled the code from somewhere and it works, but it is a bit slow. In a spreadsheet with 1.5 million cells (7000 rows, 200 columns), it takes 21 seconds to run.
I wanted to look into ways to optimize it, for example:
Function RemoveAccentsFromForeignLetters()
StartNewTask ("Removing accents from foreign letters")
Dim AccChars As String
Dim RegChars As String
AccChars = "ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïðñòóôõöùúûüýÿ"
RegChars = "SZszYAAAAAACEEEEIIIIDNOOOOOUUUUYaaaaaaceeeeiiiidnooooouuuuyy"
Set MyRange = ActiveSheet.UsedRange
Dim A As String * 1
Dim B As String * 1
Dim i As Integer
For i = 1 To Len(AccChars)
A = Mid(AccChars, i, 1)
B = Mid(RegChars, i, 1)
MyRange.Replace What:=A, Replacement:=B, LookAt:=xlPart, MatchCase:=True
' TODO: highlight changed cells yellow
Next
End FunctionI googled the code from somewhere and it works, but it is a bit slow. In a spreadsheet with 1.5 million cells (7000 rows, 200 columns), it takes 21 seconds to run.
I wanted to look into ways to optimize it, for example:
- Maybe a RegEx would be faster?
- Maybe I should pass the entire spreadsheet to a DLL using an array, have the DLL do the replace, then pass it back? One article I read suggested that this is up to 10x faster than using native Excel VBA.
- Maybe I should use the same DLL trick as above, but add multi-threading?
- Any other ideas?
Solution
In addition to @Raystafarian's observations, there are a couple of other issues that I see.
length, and will fail if they aren't. I'd add an assert that tests
that before you do anything that writes to the Worksheet.
absolutely need to (for example in an API call). VBA stores them as a
optimization that is backfiring. When you pass them to
parameters, they are actually being implicitly cast back to
variable length ones.
characters are "regular" in their native context.
are individual characters as opposed to a sub-string. VBA allows a
direct assignment of a
the array is much faster than calling
lot higher when you're doing it inside a loop. As a side note, you
should always use the
to avoid superfluous casting unless you explicitly require a
(returns a
not every cell in the entire Worksheet is going to have an accented
character in it. You should really only be concerned with cells that
do. By performing the replacement on every cell, your performance is scaling directly with the number of cells, not the number of
replacements. So, if only 5% of the cells have accented characters
you are still doing 100% of the work. This is where the regular
expression would be useful, but you can't easily pawn that off on
Excel (except maybe with using
A loop would be better - a loop over an array pulled from the
would be best.
difficult using the
storing the state of the entire sheet, then doing a cell-by-cell
comparison. It will be a lot easier to track this concurrently while
you are making changes.
With all of that in mind, I'd do something more like this:
Some quick and dirty benchmarks, all done with 2000 rows and 10 columns. In the "worse case" benchmarks, all cells have the value "ŸÀÁÂÃÄÅÇÈÉÊËÌÍÎ". In the "average case" benchmarks, 5% of the cells have the "ŸÀÁÂÃÄÅÇÈÉÊËÌÍÎ" and the rest of them contain "XXXXXXXXXXXXXXX".
- I would personally put your
AccCharsandRegCharsvariables into
Consts, because you never change their values.- Your code also requires that
AccCharsandRegCharsare the same
length, and will fail if they aren't. I'd add an assert that tests
that before you do anything that writes to the Worksheet.
- You generally want to avoid using the
Integertype unless you
absolutely need to (for example in an API call). VBA stores them as a
Long regardless of how they are declared.- Declaring
AandBas fixed-length strings is a bit of premature
optimization that is backfiring. When you pass them to
.Replace asparameters, they are actually being implicitly cast back to
variable length ones.
- Minor thing, but I prefer the term "stripped" to "regular" - all
characters are "regular" in their native context.
- Using string functions is not ideal when what you really care about
are individual characters as opposed to a sub-string. VBA allows a
direct assignment of a
String to a Byte array, and indexing intothe array is much faster than calling
Mid. The performance hit is alot higher when you're doing it inside a loop. As a side note, you
should always use the
String returning functions that end with '$'to avoid superfluous casting unless you explicitly require a
Variant type. Mid$ (returns a String) as opposed to Mid(returns a
Variant).- Your call to
.Replacemethod is acting on every single cell in your
Range. This is a huge performance hit, because I'm guessing thatnot every cell in the entire Worksheet is going to have an accented
character in it. You should really only be concerned with cells that
do. By performing the replacement on every cell, your performance is scaling directly with the number of cells, not the number of
replacements. So, if only 5% of the cells have accented characters
you are still doing 100% of the work. This is where the regular
expression would be useful, but you can't easily pawn that off on
Excel (except maybe with using
.Find, which has issues of its own).A loop would be better - a loop over an array pulled from the
Rangewould be best.
- Your "TODO: highlight changed cells yellow" is going to be much more
difficult using the
.Replace function, because it would requirestoring the state of the entire sheet, then doing a cell-by-cell
comparison. It will be a lot easier to track this concurrently while
you are making changes.
With all of that in mind, I'd do something more like this:
Private Sub RemoveAccentsFromForeignLetters()
Dim Target As Range
Set Target = ActiveSheet.UsedRange
StartNewTask ("Removing accents from foreign letters")
Dim Values() As Variant
Values = Target.Value
Debug.Assert Len(AccentedChars) = Len(StrippedChars)
Dim FindChars() As Byte
Dim ReplaceChars() As Byte
FindChars = AccentedChars
ReplaceChars = StrippedChars
Dim AccentedTest As RegExp
Set AccentedTest = New RegExp
AccentedTest.Pattern = "[" & AccentedChars & "]"
Dim index As Long
Dim character As Long
Dim col As Long
Dim row As Long
For row = 1 To UBound(Values, 1)
For col = 1 To UBound(Values, 2)
'Ignore strings that don't require character replacements.
If AccentedTest.Test(Values(row, col)) Then
Dim buffer() As Byte
buffer = StrConv(Values(row, col), vbUnicode)
'Skip every other character - VBA "Unicode" expansion
'inserts nulls there.
For character = 0 To UBound(buffer) Step 2
For index = 0 To UBound(FindChars)
If buffer(character) = FindChars(index) Then
buffer(character) = ReplaceChars(index)
End If
Next index
Next character
'Highlight changed cells yellow left as an exercise for
'the reader.
Values(row, col) = StrConv(buffer, vbFromUnicode)
End If
Next col
Next row
ActiveSheet.UsedRange = Values
End SubSome quick and dirty benchmarks, all done with 2000 rows and 10 columns. In the "worse case" benchmarks, all cells have the value "ŸÀÁÂÃÄÅÇÈÉÊËÌÍÎ". In the "average case" benchmarks, 5% of the cells have the "ŸÀÁÂÃÄÅÇÈÉÊËÌÍÎ" and the rest of them contain "XXXXXXXXXXXXXXX".
Replace method, worst case: 3.51 seconds.
Array method, worst case: 1.15 seconds.
Replace method, average case: .40 seconds (supports disabling ScreenUpdating).
Array method, average case: .08 seconds.Code Snippets
Private Sub RemoveAccentsFromForeignLetters()
Dim Target As Range
Set Target = ActiveSheet.UsedRange
StartNewTask ("Removing accents from foreign letters")
Dim Values() As Variant
Values = Target.Value
Debug.Assert Len(AccentedChars) = Len(StrippedChars)
Dim FindChars() As Byte
Dim ReplaceChars() As Byte
FindChars = AccentedChars
ReplaceChars = StrippedChars
Dim AccentedTest As RegExp
Set AccentedTest = New RegExp
AccentedTest.Pattern = "[" & AccentedChars & "]"
Dim index As Long
Dim character As Long
Dim col As Long
Dim row As Long
For row = 1 To UBound(Values, 1)
For col = 1 To UBound(Values, 2)
'Ignore strings that don't require character replacements.
If AccentedTest.Test(Values(row, col)) Then
Dim buffer() As Byte
buffer = StrConv(Values(row, col), vbUnicode)
'Skip every other character - VBA "Unicode" expansion
'inserts nulls there.
For character = 0 To UBound(buffer) Step 2
For index = 0 To UBound(FindChars)
If buffer(character) = FindChars(index) Then
buffer(character) = ReplaceChars(index)
End If
Next index
Next character
'Highlight changed cells yellow left as an exercise for
'the reader.
Values(row, col) = StrConv(buffer, vbFromUnicode)
End If
Next col
Next row
ActiveSheet.UsedRange = Values
End SubReplace method, worst case: 3.51 seconds.
Array method, worst case: 1.15 seconds.
Replace method, average case: .40 seconds (supports disabling ScreenUpdating).
Array method, average case: .08 seconds.Context
StackExchange Code Review Q#123820, answer score: 4
Revisions (0)
No revisions yet.