patternpythonMinor
Efficiently concatenate substrings of long list of strings
Viewed 0 times
efficientlysubstringslongconcatenateliststrings
Problem
I am having performance problems with the following python function:
When buffer is very large (thousands+ lines), the function gets slower and slower as it processes the buffer.
What techniques could I use to speed this function call up?
Assume that the input is a source code file. Lines of length 1 - ~120 characters. Lines may or may not have comments. The files could be many lines long. The especially problematic ones are machine generated (1-10k+ lines long).
Update: The intention is to use this as a "pre-processing" step for the buffer contents (a file). I guess I am not too interested in possibly ways to completely refactor this (i.e. methods to avoid needing to iterate through all the lines multiple times), but rather make the essence of buffer in / buffer out as fast as possible.
def remove_comments(buffer):
new_buffer = ''
lines = buffer.split('\n')
for line in lines:
line_wo_comments = line.split('--')[0] + '\n'
new_buffer = new_buffer + line_wo_comments
return new_bufferWhen buffer is very large (thousands+ lines), the function gets slower and slower as it processes the buffer.
What techniques could I use to speed this function call up?
Assume that the input is a source code file. Lines of length 1 - ~120 characters. Lines may or may not have comments. The files could be many lines long. The especially problematic ones are machine generated (1-10k+ lines long).
Update: The intention is to use this as a "pre-processing" step for the buffer contents (a file). I guess I am not too interested in possibly ways to completely refactor this (i.e. methods to avoid needing to iterate through all the lines multiple times), but rather make the essence of buffer in / buffer out as fast as possible.
Solution
The Performance Tips section at python.org has comments about doing repeated string concatenation which you may find here:
https://wiki.python.org/moin/PythonSpeed/PerformanceTips#String_Concatenation
Specifically, it suggests using
So I would try something like this, using
An advantage of this approach over splitting each line is that if there are large chunks of your program which do not have any comments they will transferred to the
Update
I would also try using a regexp replace approach - it could be even faster:
https://wiki.python.org/moin/PythonSpeed/PerformanceTips#String_Concatenation
Specifically, it suggests using
"".join(listOfStrings) instead of repeatedly appending to an accumulator with +=.So I would try something like this, using
re.finditer() to find all of the comments, and place the non-comment parts into a list:import re
def removeComments(s):
chunks = []
offset = 0
for m in re.finditer("--.*\n", s):
chunks.append( s[offset: m.start(0)] )
offset = m.end(0)-1
chunks.append( s[offset:] )
return "".join(chunks)
s = """
line 1
line 2 -- comment 2
line 3
line 4 -- comment 4
line 5
line 6 -- comment 6
line 7
"""
print removeComments(s)An advantage of this approach over splitting each line is that if there are large chunks of your program which do not have any comments they will transferred to the
chunks list in one piece instead of as separate lines.Update
I would also try using a regexp replace approach - it could be even faster:
def removeComments(s):
return re.sub('(?m)--.*, '', s)Code Snippets
import re
def removeComments(s):
chunks = []
offset = 0
for m in re.finditer("--.*\n", s):
chunks.append( s[offset: m.start(0)] )
offset = m.end(0)-1
chunks.append( s[offset:] )
return "".join(chunks)
s = """
line 1
line 2 -- comment 2
line 3
line 4 -- comment 4
line 5
line 6 -- comment 6
line 7
"""
print removeComments(s)def removeComments(s):
return re.sub('(?m)--.*$', '', s)Context
StackExchange Code Review Q#102689, answer score: 3
Revisions (0)
No revisions yet.