patternpythonMinor
Split large file into smaller files
Viewed 0 times
fileintosplitlargesmallerfiles
Problem
I recently suggested this method for emulating the Unix utility split in Python.
Is there a more elegant way of doing it?
Assume that the file chunks are too large to be held in memory. Assume that only one line can be held in memory.
I ran the Unix utility
Ugo's method:
My code:
The Unix utility split:
Is there a more elegant way of doing it?
Assume that the file chunks are too large to be held in memory. Assume that only one line can be held in memory.
import contextlib
def modulo(i,l):
return i%l
def writeline(fd_out, line):
fd_out.write('{}\n'.format(line))
file_large = 'large_file.txt'
l = 30*10**6 # lines per split file
with contextlib.ExitStack() as stack:
fd_in = stack.enter_context(open(file_large))
for i, line in enumerate(fd_in):
if not modulo(i,l):
file_split = '{}.{}'.format(file_large, i//l)
fd_out = stack.enter_context(open(file_split, 'w'))
fd_out.write('{}\n'.format(line))I ran the Unix utility
time and the Python module cProfile. Here is what I found (methods not comparable, as I was running other processes, but gives a good indication of slow parts of code):Ugo's method:
tottime filename:lineno(function)
473.088 {method 'writelines' of '_io._IOBase' objects}
485.36 real 362.04 user 58.91 sysMy code:
tottime function
243.532 modulo
543.031 writeline
419.366 {method 'format' of 'str' objects}
1169.735 {method 'write' of '_io.TextIOWrapper' objects}
3207.60 real 2291.42 user 44.64 sysThe Unix utility split:
1676.82 real 268.92 user 1399.16 sysSolution
Unfortunately, as far as I know, there is no chunks methods in the standard library.
But this makes things rather neat.
But this makes things rather neat.
from itertools import chain, islice
def chunks(iterable, n):
"chunks(ABCDE,2) => AB CD E"
iterable = iter(iterable)
while True:
yield chain([next(iterable)], islice(iterable, n-1))
l = ...
file_large = 'large_file.txt'
with open(file_large) as bigfile:
for i, lines in enumerate(chunks(bigfile, l)):
file_split = '{}.{}'.format(file_large, i)
with open(file_split, 'w') as f:
f.writelines(lines)Code Snippets
from itertools import chain, islice
def chunks(iterable, n):
"chunks(ABCDE,2) => AB CD E"
iterable = iter(iterable)
while True:
yield chain([next(iterable)], islice(iterable, n-1))
l = ...
file_large = 'large_file.txt'
with open(file_large) as bigfile:
for i, lines in enumerate(chunks(bigfile, l)):
file_split = '{}.{}'.format(file_large, i)
with open(file_split, 'w') as f:
f.writelines(lines)Context
StackExchange Code Review Q#57395, answer score: 9
Revisions (0)
No revisions yet.