HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Matching maximal nodes with velocities

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
nodesvelocitieswithmaximalmatching

Problem

I have written a code which works very good for small data-files (see below). The data file 'WSPL.dat' which I want to evaluate has a size of 2.6 GB and the data file 'VELOC.dat' has a size of 4.3 GB. The production of the data 'inputv.dat' takes almost 45 minutes and it has a size of 2.6 GB. At the final step, the code should produce a 'output_veloc_max.dat'. After 30 hours of calculations the last output could not be produced. Nothing was written in the final data. I can imagine that the opening of such a big data and writing a new data which grows continusly can lead to this problem.

My questions:

-
Can I optimize this code in such a way that it works faster?

-
If the code cannot be further optimized, should I rewrite the code in multiprocessing code?

-
If rewriting for multi processing helps, how should I do it?

```
from __future__ import print_function
import time
import re
c=[]

def get_num(x):
return int(''.join(ele for ele in x if ele.isdigit()))

with open('WSPL.dat', 'r') as f:
for line in f:
if "ND" in line:
print(line)
c=get_num(line)
print(c)
break

linenum = 0
print("Anzahl der Knoten: ",c)

print('Program started.\n')
maxima = [[float('-inf'), ''] for _ in range(c)]

with open('WSPL.dat') as f:
for line in f:
if line.startswith('TS'):
for maximum, line in zip(maxima, f):
linenum += 1
value = float(line)
if value > maximum[0]:
maximum[:] = value, linenum,line

with open('VELOC.dat', 'r') as f, open('inputv.dat', 'w') as outfile:
for line in f:
try:
line = line.strip()
columns = line.split()
vx = float(columns[0])
vy = float(columns[1])
print("{:.2f}\t{:.2f}".format(vx, vy), file=outfile)
except ValueError:
pass

linenum=[x[1] for x in maxima]
i=1
d = {}
with open('inputv.dat

Solution

Performance

Your problem report baffles me. You say that you can successfully produce inputv.dat, a 2.6 GB file, in 45 minutes. Yet, filtering inputv.dat to produce output_veloc_max.dat fails after 30 hours? The only possible bottleneck is if num in linenum — and that would only be inefficient if the number of nodes is huge. To make it work efficiently for a large number of nodes, you could use a set instead of a list:

linenum = set(x[1] for x in maxima)


Multiprocessing would not help you much, since the task is not very parallelizable. It is possible to analyze WSPL.dat in parallel with the VELOC.datinputv.dat conversion, but producing output_veloc_max.dat must wait. Since these data files are line-oriented, you cannot easily skip to arbitrary points within them, so you must process VELOC.dat sequentially. The only significant optimization would be to avoid writing out inputv.dat as an intermediate result and reading it back in.

Style

Drop imports that you not using, like import time and import re. Since you have tagged this question as python-3.x, you also don't need from __future__ import print_function.

The initialization of c as c=[] is garbage. The real assignment is c=get_num(line). The get_num function is weird, in that it gathers up all digits that appear on a line, returning 24 even if the text is "2 cool 4 me".

I'm not a fan of your maxima two-dimensional array. By storing pairs in a list, you end up with code that is obfuscated by subscripts, like if value > maximum[0] and [x[1] for x in maxima]. Ideally, lists should be used for homogeneous data of variable length. This pair would be better written as a tuple or a namedtuple. But since you don't care about the values as soon as you have obtained the line numbers, you might as well make two separate variables max_samples and max_indices.

The way you parse VELOC.dat is odd. You indiscriminately attempt to interpret the first column of every line as a float, then ignore the line if that fails. I would guess, looking at your excerpt of VELOC.dat, that a better approach would be to look for a line containing "TS", then take the following c lines, where c is the number of nodes — the same strategy that you use when parsing WSPL.dat.

This program is large enough that it should be split into functions, so that each chunk of code has a name that reflects its purpose, and so that each function has its own local variables.

from itertools import count

def n_lines_after_ts(n, f):
    """
    Filter the file, extracting the next n lines after each occurrence of "TS".
    Each result is a triple consisting of:

    * A sequential sample number (counting up from 0)
    * The node number for that sample (cycling 0, 1, ..., n-1)
    * The line from the file
    """
    line_counter = count()
    for line in f:
        if line.startswith('TS'):
            for node in range(n):
                yield next(line_counter), node, next(f)

def num_nodes(wspl_dat):
    """
    Parse the file to extract the first number that appears after "ND".
    """
    for line in wspl_dat:
        if line.startswith('ND'):
            return int(line.partition('ND')[-1])

def wspl_max_indices(num_nodes, wspl_dat):
    """
    Find the indices of the maximum sample for each node.
    """
    max_samples = [float('-inf')] * num_nodes
    max_indices = [None] * num_nodes
    for line_num, node, sample in n_lines_after_ts(num_nodes, wspl_dat):
        sample = float(sample)
        if sample > max_samples[node]:
            max_samples[node] = sample
            max_indices[node] = line_num
    return max_indices

def extract_veloc(num_nodes, veloc_dat, indices):
    """
    Extract the specified indices from the velocity data.
    """
    set_of_indices = set(indices)
    lines = {
        line_num: sample
        for line_num, _, sample in n_lines_after_ts(num_nodes, veloc_dat)
        if line_num in set_of_indices
    }
    return (lines[i] for i in indices)

with open('WSPL.dat') as wspl_dat_file, \
     open('VELOC.dat') as veloc_dat_file, \
     open('output_veloc_max.dat', 'w') as out:
    num_nodes = num_nodes(wspl_dat_file)
    print('Anzahl der Knoten: {0}'.format(num_nodes))
    max_wspl_indices = wspl_max_indices(num_nodes, wspl_dat_file)
    out.write('VECTOR\nND    {0:2d}\nST  0\nTS      0.00\n'.format(num_nodes))
    out.writelines(extract_veloc(num_nodes, veloc_dat_file, max_wspl_indices))
print('Programm ist beendet.')

Code Snippets

linenum = set(x[1] for x in maxima)
from itertools import count

def n_lines_after_ts(n, f):
    """
    Filter the file, extracting the next n lines after each occurrence of "TS".
    Each result is a triple consisting of:

    * A sequential sample number (counting up from 0)
    * The node number for that sample (cycling 0, 1, ..., n-1)
    * The line from the file
    """
    line_counter = count()
    for line in f:
        if line.startswith('TS'):
            for node in range(n):
                yield next(line_counter), node, next(f)


def num_nodes(wspl_dat):
    """
    Parse the file to extract the first number that appears after "ND".
    """
    for line in wspl_dat:
        if line.startswith('ND'):
            return int(line.partition('ND')[-1])


def wspl_max_indices(num_nodes, wspl_dat):
    """
    Find the indices of the maximum sample for each node.
    """
    max_samples = [float('-inf')] * num_nodes
    max_indices = [None] * num_nodes
    for line_num, node, sample in n_lines_after_ts(num_nodes, wspl_dat):
        sample = float(sample)
        if sample > max_samples[node]:
            max_samples[node] = sample
            max_indices[node] = line_num
    return max_indices


def extract_veloc(num_nodes, veloc_dat, indices):
    """
    Extract the specified indices from the velocity data.
    """
    set_of_indices = set(indices)
    lines = {
        line_num: sample
        for line_num, _, sample in n_lines_after_ts(num_nodes, veloc_dat)
        if line_num in set_of_indices
    }
    return (lines[i] for i in indices)


with open('WSPL.dat') as wspl_dat_file, \
     open('VELOC.dat') as veloc_dat_file, \
     open('output_veloc_max.dat', 'w') as out:
    num_nodes = num_nodes(wspl_dat_file)
    print('Anzahl der Knoten: {0}'.format(num_nodes))
    max_wspl_indices = wspl_max_indices(num_nodes, wspl_dat_file)
    out.write('VECTOR\nND    {0:2d}\nST  0\nTS      0.00\n'.format(num_nodes))
    out.writelines(extract_veloc(num_nodes, veloc_dat_file, max_wspl_indices))
print('Programm ist beendet.')

Context

StackExchange Code Review Q#145815, answer score: 5

Revisions (0)

No revisions yet.