HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Efficiently concatenate substrings of long list of strings

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
efficientlysubstringslongconcatenateliststrings

Problem

I am having performance problems with the following python function:

def remove_comments(buffer):
    new_buffer = ''
    lines = buffer.split('\n')
    for line in lines:
        line_wo_comments = line.split('--')[0] + '\n'
        new_buffer = new_buffer + line_wo_comments
    return new_buffer


When buffer is very large (thousands+ lines), the function gets slower and slower as it processes the buffer.

What techniques could I use to speed this function call up?

Assume that the input is a source code file. Lines of length 1 - ~120 characters. Lines may or may not have comments. The files could be many lines long. The especially problematic ones are machine generated (1-10k+ lines long).

Update: The intention is to use this as a "pre-processing" step for the buffer contents (a file). I guess I am not too interested in possibly ways to completely refactor this (i.e. methods to avoid needing to iterate through all the lines multiple times), but rather make the essence of buffer in / buffer out as fast as possible.

Solution

The Performance Tips section at python.org has comments about doing repeated string concatenation which you may find here:

https://wiki.python.org/moin/PythonSpeed/PerformanceTips#String_Concatenation

Specifically, it suggests using "".join(listOfStrings) instead of repeatedly appending to an accumulator with +=.

So I would try something like this, using re.finditer() to find all of the comments, and place the non-comment parts into a list:

import re

def removeComments(s):
  chunks = []
  offset = 0
  for m in re.finditer("--.*\n", s):
    chunks.append( s[offset: m.start(0)] )
    offset = m.end(0)-1
  chunks.append( s[offset:] )
  return "".join(chunks)

s = """
line 1
line 2  -- comment 2
line 3
line 4 -- comment 4
line 5
line 6 -- comment 6
line 7
"""
print removeComments(s)


An advantage of this approach over splitting each line is that if there are large chunks of your program which do not have any comments they will transferred to the chunks list in one piece instead of as separate lines.

Update

I would also try using a regexp replace approach - it could be even faster:

def removeComments(s):
  return re.sub('(?m)--.*, '', s)

Code Snippets

import re

def removeComments(s):
  chunks = []
  offset = 0
  for m in re.finditer("--.*\n", s):
    chunks.append( s[offset: m.start(0)] )
    offset = m.end(0)-1
  chunks.append( s[offset:] )
  return "".join(chunks)

s = """
line 1
line 2  -- comment 2
line 3
line 4 -- comment 4
line 5
line 6 -- comment 6
line 7
"""
print removeComments(s)
def removeComments(s):
  return re.sub('(?m)--.*$', '', s)

Context

StackExchange Code Review Q#102689, answer score: 3

Revisions (0)

No revisions yet.