patternpythonMinor
Matching maximal nodes with velocities
Viewed 0 times
nodesvelocitieswithmaximalmatching
Problem
I have written a code which works very good for small data-files (see below). The data file 'WSPL.dat' which I want to evaluate has a size of 2.6 GB and the data file 'VELOC.dat' has a size of 4.3 GB. The production of the data 'inputv.dat' takes almost 45 minutes and it has a size of 2.6 GB. At the final step, the code should produce a 'output_veloc_max.dat'. After 30 hours of calculations the last output could not be produced. Nothing was written in the final data. I can imagine that the opening of such a big data and writing a new data which grows continusly can lead to this problem.
My questions:
-
Can I optimize this code in such a way that it works faster?
-
If the code cannot be further optimized, should I rewrite the code in multiprocessing code?
-
If rewriting for multi processing helps, how should I do it?
```
from __future__ import print_function
import time
import re
c=[]
def get_num(x):
return int(''.join(ele for ele in x if ele.isdigit()))
with open('WSPL.dat', 'r') as f:
for line in f:
if "ND" in line:
print(line)
c=get_num(line)
print(c)
break
linenum = 0
print("Anzahl der Knoten: ",c)
print('Program started.\n')
maxima = [[float('-inf'), ''] for _ in range(c)]
with open('WSPL.dat') as f:
for line in f:
if line.startswith('TS'):
for maximum, line in zip(maxima, f):
linenum += 1
value = float(line)
if value > maximum[0]:
maximum[:] = value, linenum,line
with open('VELOC.dat', 'r') as f, open('inputv.dat', 'w') as outfile:
for line in f:
try:
line = line.strip()
columns = line.split()
vx = float(columns[0])
vy = float(columns[1])
print("{:.2f}\t{:.2f}".format(vx, vy), file=outfile)
except ValueError:
pass
linenum=[x[1] for x in maxima]
i=1
d = {}
with open('inputv.dat
My questions:
-
Can I optimize this code in such a way that it works faster?
-
If the code cannot be further optimized, should I rewrite the code in multiprocessing code?
-
If rewriting for multi processing helps, how should I do it?
```
from __future__ import print_function
import time
import re
c=[]
def get_num(x):
return int(''.join(ele for ele in x if ele.isdigit()))
with open('WSPL.dat', 'r') as f:
for line in f:
if "ND" in line:
print(line)
c=get_num(line)
print(c)
break
linenum = 0
print("Anzahl der Knoten: ",c)
print('Program started.\n')
maxima = [[float('-inf'), ''] for _ in range(c)]
with open('WSPL.dat') as f:
for line in f:
if line.startswith('TS'):
for maximum, line in zip(maxima, f):
linenum += 1
value = float(line)
if value > maximum[0]:
maximum[:] = value, linenum,line
with open('VELOC.dat', 'r') as f, open('inputv.dat', 'w') as outfile:
for line in f:
try:
line = line.strip()
columns = line.split()
vx = float(columns[0])
vy = float(columns[1])
print("{:.2f}\t{:.2f}".format(vx, vy), file=outfile)
except ValueError:
pass
linenum=[x[1] for x in maxima]
i=1
d = {}
with open('inputv.dat
Solution
Performance
Your problem report baffles me. You say that you can successfully produce
Multiprocessing would not help you much, since the task is not very parallelizable. It is possible to analyze
Style
Drop imports that you not using, like
The initialization of
I'm not a fan of your
The way you parse
This program is large enough that it should be split into functions, so that each chunk of code has a name that reflects its purpose, and so that each function has its own local variables.
Your problem report baffles me. You say that you can successfully produce
inputv.dat, a 2.6 GB file, in 45 minutes. Yet, filtering inputv.dat to produce output_veloc_max.dat fails after 30 hours? The only possible bottleneck is if num in linenum — and that would only be inefficient if the number of nodes is huge. To make it work efficiently for a large number of nodes, you could use a set instead of a list:linenum = set(x[1] for x in maxima)Multiprocessing would not help you much, since the task is not very parallelizable. It is possible to analyze
WSPL.dat in parallel with the VELOC.dat→inputv.dat conversion, but producing output_veloc_max.dat must wait. Since these data files are line-oriented, you cannot easily skip to arbitrary points within them, so you must process VELOC.dat sequentially. The only significant optimization would be to avoid writing out inputv.dat as an intermediate result and reading it back in.Style
Drop imports that you not using, like
import time and import re. Since you have tagged this question as python-3.x, you also don't need from __future__ import print_function.The initialization of
c as c=[] is garbage. The real assignment is c=get_num(line). The get_num function is weird, in that it gathers up all digits that appear on a line, returning 24 even if the text is "2 cool 4 me".I'm not a fan of your
maxima two-dimensional array. By storing pairs in a list, you end up with code that is obfuscated by subscripts, like if value > maximum[0] and [x[1] for x in maxima]. Ideally, lists should be used for homogeneous data of variable length. This pair would be better written as a tuple or a namedtuple. But since you don't care about the values as soon as you have obtained the line numbers, you might as well make two separate variables max_samples and max_indices.The way you parse
VELOC.dat is odd. You indiscriminately attempt to interpret the first column of every line as a float, then ignore the line if that fails. I would guess, looking at your excerpt of VELOC.dat, that a better approach would be to look for a line containing "TS", then take the following c lines, where c is the number of nodes — the same strategy that you use when parsing WSPL.dat.This program is large enough that it should be split into functions, so that each chunk of code has a name that reflects its purpose, and so that each function has its own local variables.
from itertools import count
def n_lines_after_ts(n, f):
"""
Filter the file, extracting the next n lines after each occurrence of "TS".
Each result is a triple consisting of:
* A sequential sample number (counting up from 0)
* The node number for that sample (cycling 0, 1, ..., n-1)
* The line from the file
"""
line_counter = count()
for line in f:
if line.startswith('TS'):
for node in range(n):
yield next(line_counter), node, next(f)
def num_nodes(wspl_dat):
"""
Parse the file to extract the first number that appears after "ND".
"""
for line in wspl_dat:
if line.startswith('ND'):
return int(line.partition('ND')[-1])
def wspl_max_indices(num_nodes, wspl_dat):
"""
Find the indices of the maximum sample for each node.
"""
max_samples = [float('-inf')] * num_nodes
max_indices = [None] * num_nodes
for line_num, node, sample in n_lines_after_ts(num_nodes, wspl_dat):
sample = float(sample)
if sample > max_samples[node]:
max_samples[node] = sample
max_indices[node] = line_num
return max_indices
def extract_veloc(num_nodes, veloc_dat, indices):
"""
Extract the specified indices from the velocity data.
"""
set_of_indices = set(indices)
lines = {
line_num: sample
for line_num, _, sample in n_lines_after_ts(num_nodes, veloc_dat)
if line_num in set_of_indices
}
return (lines[i] for i in indices)
with open('WSPL.dat') as wspl_dat_file, \
open('VELOC.dat') as veloc_dat_file, \
open('output_veloc_max.dat', 'w') as out:
num_nodes = num_nodes(wspl_dat_file)
print('Anzahl der Knoten: {0}'.format(num_nodes))
max_wspl_indices = wspl_max_indices(num_nodes, wspl_dat_file)
out.write('VECTOR\nND {0:2d}\nST 0\nTS 0.00\n'.format(num_nodes))
out.writelines(extract_veloc(num_nodes, veloc_dat_file, max_wspl_indices))
print('Programm ist beendet.')Code Snippets
linenum = set(x[1] for x in maxima)from itertools import count
def n_lines_after_ts(n, f):
"""
Filter the file, extracting the next n lines after each occurrence of "TS".
Each result is a triple consisting of:
* A sequential sample number (counting up from 0)
* The node number for that sample (cycling 0, 1, ..., n-1)
* The line from the file
"""
line_counter = count()
for line in f:
if line.startswith('TS'):
for node in range(n):
yield next(line_counter), node, next(f)
def num_nodes(wspl_dat):
"""
Parse the file to extract the first number that appears after "ND".
"""
for line in wspl_dat:
if line.startswith('ND'):
return int(line.partition('ND')[-1])
def wspl_max_indices(num_nodes, wspl_dat):
"""
Find the indices of the maximum sample for each node.
"""
max_samples = [float('-inf')] * num_nodes
max_indices = [None] * num_nodes
for line_num, node, sample in n_lines_after_ts(num_nodes, wspl_dat):
sample = float(sample)
if sample > max_samples[node]:
max_samples[node] = sample
max_indices[node] = line_num
return max_indices
def extract_veloc(num_nodes, veloc_dat, indices):
"""
Extract the specified indices from the velocity data.
"""
set_of_indices = set(indices)
lines = {
line_num: sample
for line_num, _, sample in n_lines_after_ts(num_nodes, veloc_dat)
if line_num in set_of_indices
}
return (lines[i] for i in indices)
with open('WSPL.dat') as wspl_dat_file, \
open('VELOC.dat') as veloc_dat_file, \
open('output_veloc_max.dat', 'w') as out:
num_nodes = num_nodes(wspl_dat_file)
print('Anzahl der Knoten: {0}'.format(num_nodes))
max_wspl_indices = wspl_max_indices(num_nodes, wspl_dat_file)
out.write('VECTOR\nND {0:2d}\nST 0\nTS 0.00\n'.format(num_nodes))
out.writelines(extract_veloc(num_nodes, veloc_dat_file, max_wspl_indices))
print('Programm ist beendet.')Context
StackExchange Code Review Q#145815, answer score: 5
Revisions (0)
No revisions yet.