HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Split large file into smaller files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fileintosplitlargesmallerfiles

Problem

I recently suggested this method for emulating the Unix utility split in Python.

Is there a more elegant way of doing it?

Assume that the file chunks are too large to be held in memory. Assume that only one line can be held in memory.

import contextlib

def modulo(i,l):
    return i%l

def writeline(fd_out, line):
    fd_out.write('{}\n'.format(line))

file_large = 'large_file.txt'
l = 30*10**6  # lines per split file
with contextlib.ExitStack() as stack:
    fd_in = stack.enter_context(open(file_large))
    for i, line in enumerate(fd_in):
        if not modulo(i,l):
           file_split = '{}.{}'.format(file_large, i//l)
           fd_out = stack.enter_context(open(file_split, 'w'))
        fd_out.write('{}\n'.format(line))


I ran the Unix utility time and the Python module cProfile. Here is what I found (methods not comparable, as I was running other processes, but gives a good indication of slow parts of code):

Ugo's method:

tottime filename:lineno(function)
473.088 {method 'writelines' of '_io._IOBase' objects}

485.36 real       362.04 user        58.91 sys


My code:

tottime function
243.532 modulo
543.031 writeline
419.366 {method 'format' of 'str' objects}
1169.735 {method 'write' of '_io.TextIOWrapper' objects}

3207.60 real      2291.42 user        44.64 sys


The Unix utility split:

1676.82 real       268.92 user      1399.16 sys

Solution

Unfortunately, as far as I know, there is no chunks methods in the standard library.
But this makes things rather neat.

from itertools import chain, islice

def chunks(iterable, n):
   "chunks(ABCDE,2) => AB CD E"
   iterable = iter(iterable)
   while True:
       yield chain([next(iterable)], islice(iterable, n-1))

l = ...
file_large = 'large_file.txt'
with open(file_large) as bigfile:
    for i, lines in enumerate(chunks(bigfile, l)):
        file_split = '{}.{}'.format(file_large, i)
        with open(file_split, 'w') as f:
            f.writelines(lines)

Code Snippets

from itertools import chain, islice

def chunks(iterable, n):
   "chunks(ABCDE,2) => AB CD E"
   iterable = iter(iterable)
   while True:
       yield chain([next(iterable)], islice(iterable, n-1))

l = ...
file_large = 'large_file.txt'
with open(file_large) as bigfile:
    for i, lines in enumerate(chunks(bigfile, l)):
        file_split = '{}.{}'.format(file_large, i)
        with open(file_split, 'w') as f:
            f.writelines(lines)

Context

StackExchange Code Review Q#57395, answer score: 9

Revisions (0)

No revisions yet.