patternpythonMinor
Splitting large text file and sorting by content
Viewed 0 times
sortingfiletextsplittinglargeandcontent
Problem
I have a large text file (~2GB) full of data. The data (sample below) gives an x, y, z coordinate, and a corresponding result on each line (there is other stuff but I don't care about it). The single large text file is too large to be useful, so I want to split it into several smaller files. However, I want each file to contain all the points on one y-plane. The first few lines of the file are below:
I've truncated some of it for readability, but you get the idea. The data I want is the four last lines.
The code currently does the following:
Not all the data for each y plane is toget
mcnp version 6 ld=05/08/13 probid = 09/09/15 23:06:39
Detector Test
Number of histories used for normalizing tallies = 2237295223.00
Mesh Tally Number 14
photon mesh tally.
Tally bin boundaries:
X direction: -600.00 -598.00 -596.00 ... 1236.00 1238.00 1240.00 1242.00 1244.00 1258.00 1260.00
Y direction: 0.00 10.00 20.00 ... 740.00 750.00 760.00 770.00 780.00 790.00 800.00 810.00 820.00 830.00 840.00 850.00 860.00
Z direction: -60.00 -58.00 -56.00 ... 592.00 594.00 596.00 598.00 600.00
Energy bin boundaries: 1.00E-03 1.00E+36
Energy X Y Z Result Rel Error Volume Rslt * Vol
1.000E+36 -599.000 5.000 -59.000 0.00000E+00 0.00000E+00 4.00000E+01 0.00000E+00
1.000E+36 -599.000 5.000 -57.000 0.00000E+00 0.00000E+00 4.00000E+01 0.00000E+00
1.000E+36 -599.000 5.000 -55.000 0.00000E+00 0.00000E+00 4.00000E+01 0.00000E+00
... and repeat forever...I've truncated some of it for readability, but you get the idea. The data I want is the four last lines.
The code currently does the following:
- Find the line data headers (
Energy X Y ...)
- Find the y value of the first line of data
- Add the data to a list until we find data with a different y value
- Dump the list to a file named with the y value, delete the list
- Repeat steps 3 and 4 until the end of the file.
Not all the data for each y plane is toget
Solution
You have a lot of nesting going on here. That's generally harder to read and parse, especially when you could actually make liberal use of
Also
Also there's nothing wrong with doing
I'd move
Also append mode will still create a new empty file if none exists, so you don't need to check for it. Just always open with
So here's how I'd put together the whole thing:
continue instead. continue will skip to the next iteration of the loop, ignoring all remaining code. So you could move your check for the header file to the top and avoid indentation:for l in f:
#If data header not found
if not i:
if l.lstrip().startswith("Energy X Y Z Result Rel Error Volume Rslt * Vol"):
i += 1
print "found start"
continueAlso
i is a terrible variable here. i is initially being used to indicate that a line has been found, then seems to become an index value. Instead I would initialise i as your index once this line is found, but use a named boolean like found_header instead. Something that's clear could remove the need for comments since if found_header is self explanatory. Likewise, I think you should use line instead of l. You do use line to replace l later. l in particular can look like a one or an upper case letter i, so it's not clear.Also there's nothing wrong with doing
line = line.split() since you don't need the original value of line after this part.I'd move
i+=1 out of the if else, since it happens in both cases anyway. You can do it at the start of the loop anyway if you just initialise i as 0. Once again, I'd use continue to save a level of nesting, like so:#If this is the first line of data
i += 1
if i == 1:
curY = line[2]
coords.extend([(line[1],line[2],line[3],line[4])])
continue
#If data has the same y value as previous
if curY == line[2]:
coords.extend([(line[1],line[2],line[3],line[4])])
continue
#New y value, dump existing data to file
fname = "Y={}.txt".format(curY)Also append mode will still create a new empty file if none exists, so you don't need to check for it. Just always open with
'a' and then write your data. You can just check if the file exists beforehand and store the result as a boolean.#if y value has already been encountered, append existing file
new_file = os.path.exists(fname)
with open("Y={}.txt".format(curY), 'a') as out:
if new_file:
out.write("X Y Z Result \n")
for coord in coords:
out.write("{:10}{:10}{:10}{:10}\n".format(*coord))So here's how I'd put together the whole thing:
import os
header = "Energy X Y Z Result Rel Error Volume Rslt * Vol"
with open("meshtal", 'r') as f:
header_found = False
i = 0
coords = []
curY = 0
for line in f:
if not header_found:
if line.lstrip().startswith(header):
print "found start"
header_found = True
continue
line = line.split()
i += 1
#If this is the first line of data
if i == 1:
curY = line[2]
coords.extend([(line[1],line[2],line[3],line[4])])
continue
#If data has the same y value as previous
if curY == line[2]:
coords.extend([(line[1],line[2],line[3],line[4])])
continue
#New y value, dump existing data to file
filename = "Y={}.txt".format(curY)
new_file = os.path.exists(fname)
with open("Y={}.txt".format(curY), 'a') as out:
if new_file:
out.write("X Y Z Result \n")
for coord in coords:
out.write("{:10}{:10}{:10}{:10}\n".format(*coord))
i = 1
coords = []
curY = line[2]
coords.extend([(line[1],line[2],line[3],line[4])])Code Snippets
for l in f:
#If data header not found
if not i:
if l.lstrip().startswith("Energy X Y Z Result Rel Error Volume Rslt * Vol"):
i += 1
print "found start"
continue#If this is the first line of data
i += 1
if i == 1:
curY = line[2]
coords.extend([(line[1],line[2],line[3],line[4])])
continue
#If data has the same y value as previous
if curY == line[2]:
coords.extend([(line[1],line[2],line[3],line[4])])
continue
#New y value, dump existing data to file
fname = "Y={}.txt".format(curY)#if y value has already been encountered, append existing file
new_file = os.path.exists(fname)
with open("Y={}.txt".format(curY), 'a') as out:
if new_file:
out.write("X Y Z Result \n")
for coord in coords:
out.write("{:10}{:10}{:10}{:10}\n".format(*coord))import os
header = "Energy X Y Z Result Rel Error Volume Rslt * Vol"
with open("meshtal", 'r') as f:
header_found = False
i = 0
coords = []
curY = 0
for line in f:
if not header_found:
if line.lstrip().startswith(header):
print "found start"
header_found = True
continue
line = line.split()
i += 1
#If this is the first line of data
if i == 1:
curY = line[2]
coords.extend([(line[1],line[2],line[3],line[4])])
continue
#If data has the same y value as previous
if curY == line[2]:
coords.extend([(line[1],line[2],line[3],line[4])])
continue
#New y value, dump existing data to file
filename = "Y={}.txt".format(curY)
new_file = os.path.exists(fname)
with open("Y={}.txt".format(curY), 'a') as out:
if new_file:
out.write("X Y Z Result \n")
for coord in coords:
out.write("{:10}{:10}{:10}{:10}\n".format(*coord))
i = 1
coords = []
curY = line[2]
coords.extend([(line[1],line[2],line[3],line[4])])Context
StackExchange Code Review Q#104340, answer score: 2
Revisions (0)
No revisions yet.