HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Printing lines in a file, filtering by the first column

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
filethecolumnprintingfirstlinesfiltering

Problem

I am setting up a project which involves a lot of data analysis, each data point represented by a line in a large text file. In order to test for certain types of data, I added a tag to each line (It needs to remain in one file, though). Now I want to conveniently choose which type of data I want to include, currently I have five types (positive, negative, neutral, objective, objective-OR-neutral). Coming from C, I wrote my parse(filename) function the following way:

# parsing.py
# flags
POS = 1
NEG = 2
NEU = 4
OBJ = 8
OON = 16

# 'flag transformation dictionary'
ftd = {u'positive': POS,
       u'negative': NEG,
       u'neutral': NEU,
       u'objective': OBJ,
       u'objective-OR-neutral': OON}

# Read flagged data from file, return only entries with the right flag.
# Each line is: 
def parse(f_loc, flags=0):
    for line in open(f_loc):
        flag, content = line.strip().split('\t')
        if ftd[flag] & flags:  # this wouldn't be as pretty with kwargs
            yield content


Using this utility function somewhere else would then look like this:

# parsing_test.py
from parsing import parse, POS, NEG, NEU, OBJ, OON

# print all data flagged as 'positive', 'negative', or 'neutral'
for entry in parse('file.txt', POS | NEG | NEU):
    print(entry)

# print all data, no matter the flag
for entry in parse('file.txt', POS | NEG | NEU | OBJ | OON):
    print(entry)


I am just starting this project and am very conscious about good style (especially intuitiveness, extendability, robustness, and speed), so I would appreciate any criticism.

Solution

For one thing, in a bitmasking expression it reads much more naturally if you put the "variable" on the left and the "constant mask" on the right. In the same way that we write if x == 0 instead of if 0 == x, we also prefer to write if x & mask rather than if mask & x.

def parse(f_loc, flags=0):
    for line in open(f_loc):
        flag, content = line.strip().split('\t')
        if flags & ftd.get(flag, 0):
            yield content


However, your bitflag identifiers POS, NEU, OON, etc., are pretty much unreadable, especially given that you already have English names for them right there in the source file! If I were you, I'd replace the entire function with simply

def parse(f_loc, flags):
    for line in open(f_loc):
        flag, content = line.strip().split('\t')
        if flag in flags:
            yield content


and then call it as

# print all data flagged as 'positive', 'negative', or 'neutral'
flags = ['positive', 'negative', 'neutral']
for entry in parse('file.txt', flags):
    print(entry)

# print all data, no matter the flag
flags = ['positive', 'negative', 'neutral', 'objective', 'objective-OR-neutral']
for entry in parse('file.txt', flags):
    print(entry)


At this point the code is simple enough that you don't even really need the parse function anymore!

If you're concerned about efficiency, and you have the power to change your input file format, consider switching to single-character flags and fixed-width fields so that you don't need to strip or split your lines.

def parse(f_loc, flags):
    for line in open(f_loc):
        if line[0] in flags:
            yield line[:1].rstrip()

# print all data flagged as 'positive', 'negative', or 'neutral'
for entry in parse('file.txt', "+-n"):
    print(entry)

# print all data, no matter the flag
for entry in parse('file.txt', "+-noO"):
    print(entry)


One last thing: IMHO

from parsing import parse
... parse('file.txt', flags) ...


is generally inferior to

import parsing
... parsing.parse('file.txt', flags) ...


This way, when I see the call site, my first question isn't "what is parse?" (could be a local variable, a function in this file, etc. etc.); instead, my first question is "what is parsing?" (which I generally know off the top of my head because it's the name of a module, and it's easy to remember all the modules I'm using). Plus, if I ever need to use any other functions out of parsing, they're all at my fingertips:

from parsing import parse, NEU  # change
... parse('file.txt', NEU) ...  # change


is inferior to

import parsing                  # no change needed
... parsing.parse('file.txt', parsing.NEU) ...  # change

Code Snippets

def parse(f_loc, flags=0):
    for line in open(f_loc):
        flag, content = line.strip().split('\t')
        if flags & ftd.get(flag, 0):
            yield content
def parse(f_loc, flags):
    for line in open(f_loc):
        flag, content = line.strip().split('\t')
        if flag in flags:
            yield content
# print all data flagged as 'positive', 'negative', or 'neutral'
flags = ['positive', 'negative', 'neutral']
for entry in parse('file.txt', flags):
    print(entry)

# print all data, no matter the flag
flags = ['positive', 'negative', 'neutral', 'objective', 'objective-OR-neutral']
for entry in parse('file.txt', flags):
    print(entry)
def parse(f_loc, flags):
    for line in open(f_loc):
        if line[0] in flags:
            yield line[:1].rstrip()

# print all data flagged as 'positive', 'negative', or 'neutral'
for entry in parse('file.txt', "+-n"):
    print(entry)

# print all data, no matter the flag
for entry in parse('file.txt', "+-noO"):
    print(entry)
from parsing import parse
... parse('file.txt', flags) ...

Context

StackExchange Code Review Q#87595, answer score: 4

Revisions (0)

No revisions yet.