patternpythonMinor
More efficient way of removing line indices?
Viewed 0 times
indiceslineremovingmorewayefficient
Problem
After profiling my current script I've found that nearly 100% of the run time (and I'm not surprised) comes from the following function:
The files themselves can have anywhere from a 15 thousand lines to 2.5 million at the most, each line is usually from 300-600 characters.
I was thinking that the write operation may be causing an increase in run-time so I could write multiple lines at a time but wouldn't know a good way to determine when it should write the lines (so as to not run out of memory).
Is there anything obvious that I might be able to change to decrease the overall run-time? To go through all the files currently it takes about 20 hours and any decrease from that is appreciated!
edit: Adding an example of input and expected output!
Input:
I have a bunch of data files that look similar to this:
The removeIndicies list contains the indicies to remove from each line, for this example we want to remove the first three and indicies 19-28.
Expected output:
def rem_asc_by_pos(xmlFile, fileOut,removeIndices):
fileIn = xmlFile.replace('.xml', '.asc')
with open(fileIn, 'r') as fin:
with open(fileOut, 'w') as fout:
#remove indices from each line and write to new file
if len(removeIndices) > 0:
for line in fin:
lineList = list(line)
checkLine = ''
newLine = []
for ind,char in enumerate(lineList):
if ind not in removeIndices:
newLine.append(char)
lineOut = ''.join(newLine)
fout.write(lineOut)The files themselves can have anywhere from a 15 thousand lines to 2.5 million at the most, each line is usually from 300-600 characters.
I was thinking that the write operation may be causing an increase in run-time so I could write multiple lines at a time but wouldn't know a good way to determine when it should write the lines (so as to not run out of memory).
Is there anything obvious that I might be able to change to decrease the overall run-time? To go through all the files currently it takes about 20 hours and any decrease from that is appreciated!
edit: Adding an example of input and expected output!
Input:
I have a bunch of data files that look similar to this:
511111 12 1.000000 1.000000 1.000000 .750000
511112 12 1.000000 .666667 .500000 .555556The removeIndicies list contains the indicies to remove from each line, for this example we want to remove the first three and indicies 19-28.
Expected output:
111 12 1.000000 1.000000 .750000
112 12 1.000000 .500000 .555556Solution
- Test case
You can't fix what you don't measure. So let's make a test case containing a million lines similar to the example input in your question:
>>> with open('in.txt', 'w') as f:
... for _ in range(10**6):
... _ = f.write('511111 12 1.000000 1.000000 1.000000 .750000\n')How long does your code take?
>>> from timeit import timeit
>>> indices = list(range(3)) + list(range(18, 29))
>>> timeit(lambda:rem_asc_by_pos('in.txt', 'out.txt', indices), number=1)
32.99110442500023- Rewrite
Here's a quick rewrite to simplify your code:
- Follow the Python style guide (PEP8).
- Write a docstring explaining what the code does.
- Pick a better name for the function.
- Combine the two
withclauses into one to keep the indentation sensible.
- No need to pass mode
'r'toopen: that's the default.
And some changes to speed it up:
- Turn the list of removed indices into a
set, so that we can look up indices in constant time.
- There's no need to split
lineinto a list: you can iterate over a string just as easily.
- Drop the
checkLineandnewListandlineOutintermediate variables.
- Cache
fout.writein a local variable to avoid having to look it up for every line.
def remove_indices(file_in, file_out, indices):
"""Copy file_in to file_out, removing the given indices."""
index_set = set(indices)
with open(file_in) as fin, open(file_out, 'w') as fout:
write = fout.write
for line in fin:
write(''.join(c for i, c in enumerate(line) if i not in index_set))
This is about twice as fast (in my test case):
>>> timeit(lambda:remove_indices('in.txt', 'out.txt', indices), number=1)
17.147783935997722- Use slices instead of indices
The next improvement is to represent the parts of the line to be kept (rather than the parts to be removed) in the form of Python
slice objects. A slice object represents a range of indices in the same form as an array slice, that is, as start and stop indices together with a step.So if you are going to remove indices 0–2 and 18–28, then you are going to keep indices 3–17 and 29–. In slice notation that would be:
>>> slices = [slice(3, 18), slice(29, None)]Note that the stop for the first slice is 18, because ranges of integers in Python are always exclusive at the top.
Here's a version of your function that takes slices to keep instead of indices to remove:
def keep_slices(file_in, file_out, slices):
"""Copy file_in to file_out, keeping the given slices."""
with open(file_in) as fin, open(file_out, 'w') as fout:
write = fout.write
for line in fin:
for s in slices:
write(line[s])This is substantially faster:
>>> timeit(lambda:keep_slices('in.txt', 'out.txt', slices), number=1)
1.877711878001719That's about an 18 times speedup (on this test case; it might be different on other test cases). Is that good enough for you? It ought to reduce your 20 hour runtime to an hour or so.
Code Snippets
>>> with open('in.txt', 'w') as f:
... for _ in range(10**6):
... _ = f.write('511111 12 1.000000 1.000000 1.000000 .750000\n')>>> from timeit import timeit
>>> indices = list(range(3)) + list(range(18, 29))
>>> timeit(lambda:rem_asc_by_pos('in.txt', 'out.txt', indices), number=1)
32.99110442500023>>> timeit(lambda:remove_indices('in.txt', 'out.txt', indices), number=1)
17.147783935997722>>> slices = [slice(3, 18), slice(29, None)]def keep_slices(file_in, file_out, slices):
"""Copy file_in to file_out, keeping the given slices."""
with open(file_in) as fin, open(file_out, 'w') as fout:
write = fout.write
for line in fin:
for s in slices:
write(line[s])Context
StackExchange Code Review Q#40773, answer score: 7
Revisions (0)
No revisions yet.