patternpythonMinor
Printing lines in a file, filtering by the first column
Viewed 0 times
filethecolumnprintingfirstlinesfiltering
Problem
I am setting up a project which involves a lot of data analysis, each data point represented by a line in a large text file. In order to test for certain types of data, I added a tag to each line (It needs to remain in one file, though). Now I want to conveniently choose which type of data I want to include, currently I have five types (
Using this utility function somewhere else would then look like this:
I am just starting this project and am very conscious about good style (especially intuitiveness, extendability, robustness, and speed), so I would appreciate any criticism.
positive, negative, neutral, objective, objective-OR-neutral). Coming from C, I wrote my parse(filename) function the following way:# parsing.py
# flags
POS = 1
NEG = 2
NEU = 4
OBJ = 8
OON = 16
# 'flag transformation dictionary'
ftd = {u'positive': POS,
u'negative': NEG,
u'neutral': NEU,
u'objective': OBJ,
u'objective-OR-neutral': OON}
# Read flagged data from file, return only entries with the right flag.
# Each line is:
def parse(f_loc, flags=0):
for line in open(f_loc):
flag, content = line.strip().split('\t')
if ftd[flag] & flags: # this wouldn't be as pretty with kwargs
yield contentUsing this utility function somewhere else would then look like this:
# parsing_test.py
from parsing import parse, POS, NEG, NEU, OBJ, OON
# print all data flagged as 'positive', 'negative', or 'neutral'
for entry in parse('file.txt', POS | NEG | NEU):
print(entry)
# print all data, no matter the flag
for entry in parse('file.txt', POS | NEG | NEU | OBJ | OON):
print(entry)I am just starting this project and am very conscious about good style (especially intuitiveness, extendability, robustness, and speed), so I would appreciate any criticism.
Solution
For one thing, in a bitmasking expression it reads much more naturally if you put the "variable" on the left and the "constant mask" on the right. In the same way that we write
However, your bitflag identifiers
and then call it as
At this point the code is simple enough that you don't even really need the
If you're concerned about efficiency, and you have the power to change your input file format, consider switching to single-character flags and fixed-width fields so that you don't need to
One last thing: IMHO
is generally inferior to
This way, when I see the call site, my first question isn't "what is
is inferior to
if x == 0 instead of if 0 == x, we also prefer to write if x & mask rather than if mask & x.def parse(f_loc, flags=0):
for line in open(f_loc):
flag, content = line.strip().split('\t')
if flags & ftd.get(flag, 0):
yield contentHowever, your bitflag identifiers
POS, NEU, OON, etc., are pretty much unreadable, especially given that you already have English names for them right there in the source file! If I were you, I'd replace the entire function with simplydef parse(f_loc, flags):
for line in open(f_loc):
flag, content = line.strip().split('\t')
if flag in flags:
yield contentand then call it as
# print all data flagged as 'positive', 'negative', or 'neutral'
flags = ['positive', 'negative', 'neutral']
for entry in parse('file.txt', flags):
print(entry)
# print all data, no matter the flag
flags = ['positive', 'negative', 'neutral', 'objective', 'objective-OR-neutral']
for entry in parse('file.txt', flags):
print(entry)At this point the code is simple enough that you don't even really need the
parse function anymore!If you're concerned about efficiency, and you have the power to change your input file format, consider switching to single-character flags and fixed-width fields so that you don't need to
strip or split your lines.def parse(f_loc, flags):
for line in open(f_loc):
if line[0] in flags:
yield line[:1].rstrip()
# print all data flagged as 'positive', 'negative', or 'neutral'
for entry in parse('file.txt', "+-n"):
print(entry)
# print all data, no matter the flag
for entry in parse('file.txt', "+-noO"):
print(entry)One last thing: IMHO
from parsing import parse
... parse('file.txt', flags) ...is generally inferior to
import parsing
... parsing.parse('file.txt', flags) ...This way, when I see the call site, my first question isn't "what is
parse?" (could be a local variable, a function in this file, etc. etc.); instead, my first question is "what is parsing?" (which I generally know off the top of my head because it's the name of a module, and it's easy to remember all the modules I'm using). Plus, if I ever need to use any other functions out of parsing, they're all at my fingertips:from parsing import parse, NEU # change
... parse('file.txt', NEU) ... # changeis inferior to
import parsing # no change needed
... parsing.parse('file.txt', parsing.NEU) ... # changeCode Snippets
def parse(f_loc, flags=0):
for line in open(f_loc):
flag, content = line.strip().split('\t')
if flags & ftd.get(flag, 0):
yield contentdef parse(f_loc, flags):
for line in open(f_loc):
flag, content = line.strip().split('\t')
if flag in flags:
yield content# print all data flagged as 'positive', 'negative', or 'neutral'
flags = ['positive', 'negative', 'neutral']
for entry in parse('file.txt', flags):
print(entry)
# print all data, no matter the flag
flags = ['positive', 'negative', 'neutral', 'objective', 'objective-OR-neutral']
for entry in parse('file.txt', flags):
print(entry)def parse(f_loc, flags):
for line in open(f_loc):
if line[0] in flags:
yield line[:1].rstrip()
# print all data flagged as 'positive', 'negative', or 'neutral'
for entry in parse('file.txt', "+-n"):
print(entry)
# print all data, no matter the flag
for entry in parse('file.txt', "+-noO"):
print(entry)from parsing import parse
... parse('file.txt', flags) ...Context
StackExchange Code Review Q#87595, answer score: 4
Revisions (0)
No revisions yet.