HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Gathering data from huge text files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
gatheringtexthugefilesfromdata

Problem

I have a text file composed of several subsequent tables. I need to get certain values from certain tables and save them in an output file. Every table has a header which contains a string that can be used to find specific tables. The size of these text files can vary from tenths of MB to some GB. I have written the following script to do the job:

string = 'str'
index = 20
n = 2

in_file = open('file.txt')
out_file = open("out.txt", 'w')
current_line = 0

for i in range(-index,index+1):
    for j in range(-index,index+1):
        for line in in_file:
            if string in line:
                En = line.split().pop(4)
                for line in in_file:
                    current_line += 1
                    if current_line == 2*(n+1)+2:
                        x = line.split().pop(10)
                    elif current_line == 3*(n+1)+2:
                        y = line.split().pop(10)
                    elif current_line == 4*(n+1)+2:
                        z = line.split().pop(10)
                        current_line = 0
                        break
                print i, j, En, x, y, z
                data = "%d %d %s %s %s %s\n" % (i,j,En,x,y,z)
                out_file.write(data)
                break
in_file.close()
out_file.close()


The script reads the file line by line searching for the specified string ('str' in this example). When found, it then extracts a value from the line containing the string and continue reading the lines that form the data table itself. Since all the tables in the file have the same number of lines and columns, I've used the variable current_line to keep track of which line is read and to specify which line contains the data I need. The first two for-loops are just there to generate a pair of indexes that I need to be printed in the output file (in this case they are between -20 and 20).

The script works fine. But since I've been learning python by myself for about one month, and the files I have to handle can

Solution

All this nesting obfuscates your code pretty badly, so let's start by reducing it so it's easier to read. If you need to test something in your for loop, you should flip the test and use continue, like this:

for line in in_file:
        if not string in line:
            continue
        En = line.split().pop(4)


This saves you one block of nesting at least. You could also combine your two range calls with itertools.product. product performs a nested loop based on the two iterables it's passed. It essentially creates the exact nested loop you have with just one line:

for i, j in itertools.product(range(-index,index+1), range(-index,index+1)):
    for line in in_file:
        if not string in line:
            continue
        En = line.split().pop(4)
        for line in in_file:
            current_line += 1
            if current_line == 2*(n+1)+2:
                x = line.split().pop(10)
            elif current_line == 3*(n+1)+2:
                y = line.split().pop(10)
            elif current_line == 4*(n+1)+2:
                z = line.split().pop(10)
                current_line = 0
                break
        print i, j, En, x, y, z
        data = "%d %d %s %s %s %s\n" % (i,j,En,x,y,z)
        out_file.write(data)
        break


It's certainly more readable now. Though still confusing, why are you looping over in_file within the in_file nested group? It seems redundant but also I think you have misunderstood how file reading works. When you have read the first ten lines of a file object and then start a new loop over that same object you wont re-read the first ten lines. This means there's no reason to nest your inner for loop like this. This will iterate over the same values:

for i, j in itertools.product(range(-index,index+1), range(-index,index+1)):
    for line in in_file:
        if not string in line:
            continue
        En = line.split().pop(4)
    for line in in_file:
        current_line += 1


Now about that current_line, you actually don't need to use that, you could use enumerate instead to automatically count the iteration you're on. enumerate takes an iterable and returns both the value, ie. line but also a number which you can use to replace current_line. enumerate usually starts counting from 0, but you can pass an optional starting parameter:

for current_line, line in enumerate(in_file, 1):


I think current_line is a confusing name especially now that it's next to line. index or i are what I'd use, personally.

Also you're creating the same string twice, once to print and once to write. Just create data upfront so you can print the same value, this will avoid confusion with typo discrepancies between the two. You can manually add the newline in the write call instead. You should also use the new way of formatting, str.format, as it's type agnostic:

data = "{} {} {} {} {}".format(i, j, En, x, y, z)
        print (data)
        out_file.write(data + '\n')


Now let's see how this looks with these changes:

for i, j in itertools.product(range(-index,index+1), range(-index,index+1)):
    for line in in_file:
        if not string in line:
            continue
        En = line.split().pop(4)
        break
    for idx, line in enumerate(in_file, 1):
        if idx == 2*(n+1)+2:
            x = line.split().pop(10)
        elif idx == 3*(n+1)+2:
            y = line.split().pop(10)
        elif idx == 4*(n+1)+2:
            z = line.split().pop(10)
            break

    data = "{} {} {} {} {}".format(i, j, En, x, y, z)
    print (data)
    out_file.write(data + '\n')


Now, a note about using open for your files. It's actually better to use with because it will always ensure your files are closed no matter what, even if error's occur during your script. It will unfortunately reintroduce nesting, but it it is worth it.

with open("file.txt") as in_file, open("out.txt", "w") as out_file:
    for i, j in itertools.product(range(-index,index+1), range(-index,index+1)):


Your names could definitely use work too. x, y and z don't mean anything to me. What is n? Is it a constant? Does it change? string is likewise vague, something like key would make more sense. index is confusing too, as it's only for specificying the range you loop over. Would boundary work perhaps? The problem is, your names are so vague I can't even suggest better ones because I don't know what these things do.

Also you should match Python naming conventions. You use snake_case for current_line which is good, but En looks like a class due to the capital. It should just be en, or better yet a descriptive name. Even i and j would benefit from descriptive names if it was possible.

Code Snippets

for line in in_file:
        if not string in line:
            continue
        En = line.split().pop(4)
for i, j in itertools.product(range(-index,index+1), range(-index,index+1)):
    for line in in_file:
        if not string in line:
            continue
        En = line.split().pop(4)
        for line in in_file:
            current_line += 1
            if current_line == 2*(n+1)+2:
                x = line.split().pop(10)
            elif current_line == 3*(n+1)+2:
                y = line.split().pop(10)
            elif current_line == 4*(n+1)+2:
                z = line.split().pop(10)
                current_line = 0
                break
        print i, j, En, x, y, z
        data = "%d %d %s %s %s %s\n" % (i,j,En,x,y,z)
        out_file.write(data)
        break
for i, j in itertools.product(range(-index,index+1), range(-index,index+1)):
    for line in in_file:
        if not string in line:
            continue
        En = line.split().pop(4)
    for line in in_file:
        current_line += 1
for current_line, line in enumerate(in_file, 1):
data = "{} {} {} {} {}".format(i, j, En, x, y, z)
        print (data)
        out_file.write(data + '\n')

Context

StackExchange Code Review Q#107490, answer score: 6

Revisions (0)

No revisions yet.