HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Splitting large text file and sorting by content

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
sortingfiletextsplittinglargeandcontent

Problem

I have a large text file (~2GB) full of data. The data (sample below) gives an x, y, z coordinate, and a corresponding result on each line (there is other stuff but I don't care about it). The single large text file is too large to be useful, so I want to split it into several smaller files. However, I want each file to contain all the points on one y-plane. The first few lines of the file are below:

mcnp   version 6     ld=05/08/13  probid =  09/09/15 23:06:39    
 Detector Test    
 Number of histories used for normalizing tallies =    2237295223.00    

 Mesh Tally Number        14    
 photon   mesh tally.   

 Tally bin boundaries:

    X direction:   -600.00   -598.00   -596.00   ... 1236.00   1238.00   1240.00   1242.00   1244.00    1258.00   1260.00

    Y direction:      0.00     10.00     20.00     ...    740.00    750.00    760.00    770.00    780.00    790.00    800.00    810.00    820.00    830.00   840.00    850.00    860.00

    Z direction:    -60.00    -58.00    -56.00    ...  592.00    594.00    596.00    598.00    600.00    
    Energy bin boundaries: 1.00E-03 1.00E+36    

   Energy         X         Y         Z     Result     Rel Error     Volume    Rslt * Vol    
  1.000E+36  -599.000     5.000   -59.000 0.00000E+00 0.00000E+00 4.00000E+01 0.00000E+00    
  1.000E+36  -599.000     5.000   -57.000 0.00000E+00 0.00000E+00 4.00000E+01 0.00000E+00    
  1.000E+36  -599.000     5.000   -55.000 0.00000E+00 0.00000E+00 4.00000E+01 0.00000E+00
... and repeat forever...


I've truncated some of it for readability, but you get the idea. The data I want is the four last lines.

The code currently does the following:

  • Find the line data headers (Energy X Y ...)



  • Find the y value of the first line of data



  • Add the data to a list until we find data with a different y value



  • Dump the list to a file named with the y value, delete the list



  • Repeat steps 3 and 4 until the end of the file.



Not all the data for each y plane is toget

Solution

You have a lot of nesting going on here. That's generally harder to read and parse, especially when you could actually make liberal use of continue instead. continue will skip to the next iteration of the loop, ignoring all remaining code. So you could move your check for the header file to the top and avoid indentation:

for l in f:    
    #If data header not found
    if not i:
        if l.lstrip().startswith("Energy         X         Y         Z     Result     Rel Error     Volume    Rslt * Vol"):
            i += 1
            print "found start"
        continue


Also i is a terrible variable here. i is initially being used to indicate that a line has been found, then seems to become an index value. Instead I would initialise i as your index once this line is found, but use a named boolean like found_header instead. Something that's clear could remove the need for comments since if found_header is self explanatory. Likewise, I think you should use line instead of l. You do use line to replace l later. l in particular can look like a one or an upper case letter i, so it's not clear.

Also there's nothing wrong with doing line = line.split() since you don't need the original value of line after this part.

I'd move i+=1 out of the if else, since it happens in both cases anyway. You can do it at the start of the loop anyway if you just initialise i as 0. Once again, I'd use continue to save a level of nesting, like so:

#If this is the first line of data
i += 1
if i == 1:
    curY = line[2]
    coords.extend([(line[1],line[2],line[3],line[4])])
    continue

#If data has the same y value as previous
if curY == line[2]:
    coords.extend([(line[1],line[2],line[3],line[4])])
    continue

#New y value, dump existing data to file
fname = "Y={}.txt".format(curY)


Also append mode will still create a new empty file if none exists, so you don't need to check for it. Just always open with 'a' and then write your data. You can just check if the file exists beforehand and store the result as a boolean.

#if y value has already been encountered, append existing file
new_file = os.path.exists(fname)
with open("Y={}.txt".format(curY), 'a') as out:
    if new_file:
        out.write("X         Y         Z         Result     \n")
    for coord in coords:
        out.write("{:10}{:10}{:10}{:10}\n".format(*coord))


So here's how I'd put together the whole thing:

import os

header = "Energy         X         Y         Z     Result     Rel Error     Volume    Rslt * Vol"

with open("meshtal", 'r') as f:
    header_found = False
    i = 0
    coords = []
    curY = 0

    for line in f:    
        if not header_found:
            if line.lstrip().startswith(header):
                print "found start"
                header_found = True
            continue

        line = line.split()
        i += 1

        #If this is the first line of data
        if i == 1:
            curY = line[2]
            coords.extend([(line[1],line[2],line[3],line[4])])
            continue

        #If data has the same y value as previous
        if curY == line[2]:
            coords.extend([(line[1],line[2],line[3],line[4])])
            continue

        #New y value, dump existing data to file
        filename = "Y={}.txt".format(curY)
        new_file = os.path.exists(fname)
        with open("Y={}.txt".format(curY), 'a') as out:
            if new_file:
                out.write("X         Y         Z         Result     \n")
            for coord in coords:
                out.write("{:10}{:10}{:10}{:10}\n".format(*coord))    

        i = 1
        coords = []
        curY = line[2]
        coords.extend([(line[1],line[2],line[3],line[4])])

Code Snippets

for l in f:    
    #If data header not found
    if not i:
        if l.lstrip().startswith("Energy         X         Y         Z     Result     Rel Error     Volume    Rslt * Vol"):
            i += 1
            print "found start"
        continue
#If this is the first line of data
i += 1
if i == 1:
    curY = line[2]
    coords.extend([(line[1],line[2],line[3],line[4])])
    continue

#If data has the same y value as previous
if curY == line[2]:
    coords.extend([(line[1],line[2],line[3],line[4])])
    continue

#New y value, dump existing data to file
fname = "Y={}.txt".format(curY)
#if y value has already been encountered, append existing file
new_file = os.path.exists(fname)
with open("Y={}.txt".format(curY), 'a') as out:
    if new_file:
        out.write("X         Y         Z         Result     \n")
    for coord in coords:
        out.write("{:10}{:10}{:10}{:10}\n".format(*coord))
import os

header = "Energy         X         Y         Z     Result     Rel Error     Volume    Rslt * Vol"

with open("meshtal", 'r') as f:
    header_found = False
    i = 0
    coords = []
    curY = 0

    for line in f:    
        if not header_found:
            if line.lstrip().startswith(header):
                print "found start"
                header_found = True
            continue

        line = line.split()
        i += 1

        #If this is the first line of data
        if i == 1:
            curY = line[2]
            coords.extend([(line[1],line[2],line[3],line[4])])
            continue

        #If data has the same y value as previous
        if curY == line[2]:
            coords.extend([(line[1],line[2],line[3],line[4])])
            continue

        #New y value, dump existing data to file
        filename = "Y={}.txt".format(curY)
        new_file = os.path.exists(fname)
        with open("Y={}.txt".format(curY), 'a') as out:
            if new_file:
                out.write("X         Y         Z         Result     \n")
            for coord in coords:
                out.write("{:10}{:10}{:10}{:10}\n".format(*coord))    

        i = 1
        coords = []
        curY = line[2]
        coords.extend([(line[1],line[2],line[3],line[4])])

Context

StackExchange Code Review Q#104340, answer score: 2

Revisions (0)

No revisions yet.