HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Replace one-liner sed/awk with python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
linerawkwithsedreplaceonepython

Problem

I have some files that I want to process, and I know how to do it in sed/awk (for each one):

awk '{if (index($0,"#")!=1) {line++; if (line%3==1) {print $2,$3}}}' q.post  > q


or

grep -v "#" q.post | awk '{if (NR%3==1) {print $2,$3}}'


It's one line, and rather beautiful and clear.

Now, my main program is in python (2.7). Calling sed/awk from python is a bit tedious—I get some error—and I'd rather use a nice pythonic way to do it.

So far I have:

pp_files = glob.glob("*gauss.post")
    for pp in pp_files:
        ppf = open(pp)
        with open(pp[:pp.rfind(".post")] + "_clean.post", "w") as outfile:
            counter = 0
            temp = []
            for line in ppf.readlines():
                if not line.startswith("#"):
                    temp.append(line)
            for line in temp:
                if counter % 3 == 0:
                    outfile.write(" ".join(line.split()[1:3]) + '\n')
                counter += 1
        ppf.close()


Meh.

It works, but it's not beautiful. Is there a pythonic way, preferentially a clear one liner (not 10 imbricated list comprehension, to replace awk and sed ?

Thanks

Solution

First you should add open(pp) to your with.
Always use with with open.
This is as it will always close the file, even if there is an error.

But onto your code. You seem to dislike comprehensions. I don't really get why.
Take your code:

for line in ppf.readlines():
    if not line.startswith("#"):
        temp.append(line)


This can instead be:

[line for line in ppf if not line.startswith("#")]


I know which I find easier to read. But if you don't like it fair dues.
After this I'd then slice the list, you want every third line.
To do this we can use the slice operator, say you have the string abcdefghijk, but you only want every third character.
You'd do 'abcdefghijk'[::3]. This gets adgj.
This removes the need for counter, and so can simplify your code to:

for pp in pp_files:
    with open(pp) as ppf, open(pp[:pp.rfind(".post")] + "_clean.post", "w") as outfile:
        for line in [line for line in ppf if not line.startswith("#")][::3]:
            outfile.write(" ".join(line.split()[1:3]) + '\n')


But if your file is large it'll read all of it into a list, then take a third of it put it in another list.
That's bad, instead if you use a generator comprehension and itertools.islice then you can achieve the same as above.
But the program will use less memory.

for pp in pp_files:
    with open(pp) as ppf, open(pp[:pp.rfind(".post")] + "_clean.post", "w") as outfile:
        for line in islice((line for line in ppf if not line.startswith("#")), 0, None, 3):
            outfile.write(" ".join(line.split()[1:3]) + '\n')

Code Snippets

for line in ppf.readlines():
    if not line.startswith("#"):
        temp.append(line)
[line for line in ppf if not line.startswith("#")]
for pp in pp_files:
    with open(pp) as ppf, open(pp[:pp.rfind(".post")] + "_clean.post", "w") as outfile:
        for line in [line for line in ppf if not line.startswith("#")][::3]:
            outfile.write(" ".join(line.split()[1:3]) + '\n')
for pp in pp_files:
    with open(pp) as ppf, open(pp[:pp.rfind(".post")] + "_clean.post", "w") as outfile:
        for line in islice((line for line in ppf if not line.startswith("#")), 0, None, 3):
            outfile.write(" ".join(line.split()[1:3]) + '\n')

Context

StackExchange Code Review Q#148547, answer score: 10

Revisions (0)

No revisions yet.