HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Python parallelization using Popen

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
popenusingpythonparallelization

Problem

I frequently run a script similar to the one below to analyze an arbitrary number of files in parallel on a computer with 8 cores.

I use Popen to control each thread, but sometimes run into problems when there is much stdout or stderr, as the buffer fills up. I solve this by frequently reading from the streams. I also print the streams from one of the threads to help me follow the progress of the analysis.

I'm curious on alternative methods to thread using Python, and general comments about the implementation, which, as always, has room for improvement. Thanks!

```
import os, sys
import time
import subprocess

def parallelize(analysis_program_path, filenames, N_CORES):
'''
Function that parallelizes an analysis on a list of files on N_CORES number of cores
'''
running = []
sys.stderr.write('Starting analyses\n')
while filenames or running:
while filenames and len(running) < N_CORES:
# Submit new analysis
filename = filenames.pop(0)
cmd = '%s %s' % (analysis_program_path, filename)
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
sys.stderr.write('Analyzing %s\n' % filename)
running.append((cmd, p))
i = 0
while i < len(running):
(cmd, p) = running[i]
returncode = p.poll()
st_out = p.stdout.read()
st_err = p.stderr.read() # Read the buffer! Otherwise it fills up and blocks the script
if i == 0: # Just print one of the processes

Solution

Python has what you want built into the standard library: see the multiprocessing module, and in particular the map method of the Pool class.

So you can implement what you want in one line, perhaps like this:

from multiprocessing import Pool

def parallelize(analysis, filenames, processes):
    '''
    Call `analysis` for each file in the sequence `filenames`, using
    up to `processes` parallel processes. Wait for them all to complete
    and then return a list of results.
    '''
    return Pool(processes).map(analysis, filenames, chunksize = 1)

Code Snippets

from multiprocessing import Pool

def parallelize(analysis, filenames, processes):
    '''
    Call `analysis` for each file in the sequence `filenames`, using
    up to `processes` parallel processes. Wait for them all to complete
    and then return a list of results.
    '''
    return Pool(processes).map(analysis, filenames, chunksize = 1)

Context

StackExchange Code Review Q#20416, answer score: 5

Revisions (0)

No revisions yet.