patternpythonMinor
Gathering data from huge text files
Viewed 0 times
gatheringtexthugefilesfromdata
Problem
I have a text file composed of several subsequent tables. I need to get certain values from certain tables and save them in an output file. Every table has a header which contains a string that can be used to find specific tables. The size of these text files can vary from tenths of MB to some GB. I have written the following script to do the job:
The script reads the file line by line searching for the specified string ('str' in this example). When found, it then extracts a value from the line containing the string and continue reading the lines that form the data table itself. Since all the tables in the file have the same number of lines and columns, I've used the variable current_line to keep track of which line is read and to specify which line contains the data I need. The first two for-loops are just there to generate a pair of indexes that I need to be printed in the output file (in this case they are between -20 and 20).
The script works fine. But since I've been learning python by myself for about one month, and the files I have to handle can
string = 'str'
index = 20
n = 2
in_file = open('file.txt')
out_file = open("out.txt", 'w')
current_line = 0
for i in range(-index,index+1):
for j in range(-index,index+1):
for line in in_file:
if string in line:
En = line.split().pop(4)
for line in in_file:
current_line += 1
if current_line == 2*(n+1)+2:
x = line.split().pop(10)
elif current_line == 3*(n+1)+2:
y = line.split().pop(10)
elif current_line == 4*(n+1)+2:
z = line.split().pop(10)
current_line = 0
break
print i, j, En, x, y, z
data = "%d %d %s %s %s %s\n" % (i,j,En,x,y,z)
out_file.write(data)
break
in_file.close()
out_file.close()The script reads the file line by line searching for the specified string ('str' in this example). When found, it then extracts a value from the line containing the string and continue reading the lines that form the data table itself. Since all the tables in the file have the same number of lines and columns, I've used the variable current_line to keep track of which line is read and to specify which line contains the data I need. The first two for-loops are just there to generate a pair of indexes that I need to be printed in the output file (in this case they are between -20 and 20).
The script works fine. But since I've been learning python by myself for about one month, and the files I have to handle can
Solution
All this nesting obfuscates your code pretty badly, so let's start by reducing it so it's easier to read. If you need to test something in your
This saves you one block of nesting at least. You could also combine your two range calls with
It's certainly more readable now. Though still confusing, why are you looping over
Now about that
I think
Also you're creating the same string twice, once to print and once to write. Just create
Now let's see how this looks with these changes:
Now, a note about using
Your names could definitely use work too.
Also you should match Python naming conventions. You use snake_case for
for loop, you should flip the test and use continue, like this:for line in in_file:
if not string in line:
continue
En = line.split().pop(4)This saves you one block of nesting at least. You could also combine your two range calls with
itertools.product. product performs a nested loop based on the two iterables it's passed. It essentially creates the exact nested loop you have with just one line:for i, j in itertools.product(range(-index,index+1), range(-index,index+1)):
for line in in_file:
if not string in line:
continue
En = line.split().pop(4)
for line in in_file:
current_line += 1
if current_line == 2*(n+1)+2:
x = line.split().pop(10)
elif current_line == 3*(n+1)+2:
y = line.split().pop(10)
elif current_line == 4*(n+1)+2:
z = line.split().pop(10)
current_line = 0
break
print i, j, En, x, y, z
data = "%d %d %s %s %s %s\n" % (i,j,En,x,y,z)
out_file.write(data)
breakIt's certainly more readable now. Though still confusing, why are you looping over
in_file within the in_file nested group? It seems redundant but also I think you have misunderstood how file reading works. When you have read the first ten lines of a file object and then start a new loop over that same object you wont re-read the first ten lines. This means there's no reason to nest your inner for loop like this. This will iterate over the same values:for i, j in itertools.product(range(-index,index+1), range(-index,index+1)):
for line in in_file:
if not string in line:
continue
En = line.split().pop(4)
for line in in_file:
current_line += 1Now about that
current_line, you actually don't need to use that, you could use enumerate instead to automatically count the iteration you're on. enumerate takes an iterable and returns both the value, ie. line but also a number which you can use to replace current_line. enumerate usually starts counting from 0, but you can pass an optional starting parameter:for current_line, line in enumerate(in_file, 1):I think
current_line is a confusing name especially now that it's next to line. index or i are what I'd use, personally.Also you're creating the same string twice, once to print and once to write. Just create
data upfront so you can print the same value, this will avoid confusion with typo discrepancies between the two. You can manually add the newline in the write call instead. You should also use the new way of formatting, str.format, as it's type agnostic:data = "{} {} {} {} {}".format(i, j, En, x, y, z)
print (data)
out_file.write(data + '\n')Now let's see how this looks with these changes:
for i, j in itertools.product(range(-index,index+1), range(-index,index+1)):
for line in in_file:
if not string in line:
continue
En = line.split().pop(4)
break
for idx, line in enumerate(in_file, 1):
if idx == 2*(n+1)+2:
x = line.split().pop(10)
elif idx == 3*(n+1)+2:
y = line.split().pop(10)
elif idx == 4*(n+1)+2:
z = line.split().pop(10)
break
data = "{} {} {} {} {}".format(i, j, En, x, y, z)
print (data)
out_file.write(data + '\n')Now, a note about using
open for your files. It's actually better to use with because it will always ensure your files are closed no matter what, even if error's occur during your script. It will unfortunately reintroduce nesting, but it it is worth it.with open("file.txt") as in_file, open("out.txt", "w") as out_file:
for i, j in itertools.product(range(-index,index+1), range(-index,index+1)):Your names could definitely use work too.
x, y and z don't mean anything to me. What is n? Is it a constant? Does it change? string is likewise vague, something like key would make more sense. index is confusing too, as it's only for specificying the range you loop over. Would boundary work perhaps? The problem is, your names are so vague I can't even suggest better ones because I don't know what these things do.Also you should match Python naming conventions. You use snake_case for
current_line which is good, but En looks like a class due to the capital. It should just be en, or better yet a descriptive name. Even i and j would benefit from descriptive names if it was possible.Code Snippets
for line in in_file:
if not string in line:
continue
En = line.split().pop(4)for i, j in itertools.product(range(-index,index+1), range(-index,index+1)):
for line in in_file:
if not string in line:
continue
En = line.split().pop(4)
for line in in_file:
current_line += 1
if current_line == 2*(n+1)+2:
x = line.split().pop(10)
elif current_line == 3*(n+1)+2:
y = line.split().pop(10)
elif current_line == 4*(n+1)+2:
z = line.split().pop(10)
current_line = 0
break
print i, j, En, x, y, z
data = "%d %d %s %s %s %s\n" % (i,j,En,x,y,z)
out_file.write(data)
breakfor i, j in itertools.product(range(-index,index+1), range(-index,index+1)):
for line in in_file:
if not string in line:
continue
En = line.split().pop(4)
for line in in_file:
current_line += 1for current_line, line in enumerate(in_file, 1):data = "{} {} {} {} {}".format(i, j, En, x, y, z)
print (data)
out_file.write(data + '\n')Context
StackExchange Code Review Q#107490, answer score: 6
Revisions (0)
No revisions yet.