HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Fuzzy grep for fuzzy bears in pure Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
puregrepforpythonbearsfuzzy

Problem

I am aware of the Python modules galore to do this, but this was partially a learning experience and partially all the functionality I need and no more.

I'm writing a simple interpreter for a Forth-like language, and because my CLIs are of the most high quality,[citation needed] I need to make the entire source (docstrings especially, but the rest of it too) searchable on a whim from within the interpreter.

To do this, I cooked up a little script I'm quite pleased with, which finds a bunch of possible matches of varying relevance and returns them as a populus structure.

Its fuzziness is sometimes wayy too fuzzy, due to the extremely simplistic way in which it's implemented. Mess around with the constants and kwargs to see what you get. Docs or (its own) source code make good test material.

```
from __future__ import division
from string import punctuation as punc
from difflib import SequenceMatcher as seqmat

DEBUG = True

class Match():

def __init__(self, line, line_no, match_type,
prectxt, postctxt, misc=None):
(self.line, self.line_no,
self.match_type, self.prectxt,
self.postctxt, self.misc_data) = (line, line_no,
match_type, prectxt, postctxt, misc)

self.matchinfo = (self.line, self.line_no, self.match_type,
self.prectxt, self.postctxt, self.misc_data)

def match(self): return self.matchinfo

def misc(self): return self.misc_data

def fuzzy_files(needle, file_haystack, **kwargs):
"""fuzzy grep in files. turns kwargs in to fuzzy_files"""

metamatches = {}

for fname in file_haystack:

fio = open(fname, "r")
fct = fio.read()
fio.close()

metamatches[fname] = fuzzy_grep(needle, fct, **kwargs)

return metamatches

def fuzzy_grep(needle, haystack,
TOLERANCE_BASE = .3, CONTEXT_LINES = 2,
PUNC_IS_JUNK = True, JUNK_FUNC = None,
CONSI

Solution

Opening files

While it's a small nitpick, it was bothering me a little. On these three lines, you're opening a file, reading it and then closing it:

fio = open(fname, "r")
fct = fio.read()
fio.close()


While this is a small chunk of code, if an exception occurs between the file opening, or closing (while the file is being read, for example), the resources used to open the file are not released. If you want to ensure that the resources are properly released, you need to use a context manager by writing out a with statement. Your above code would become this:

with open(fname, "r") as fio:
    fct = fio.read()

# continue to do things with `fct`


If you need to support pre-Python 2.5 for some reason, then you'd have to write some hacky code using try and finally. You'd end up with something looking like this:

fio = open(fname, "r")

try:
    fct = fio.read()
finally:
    fio.close()

# Do more stuff with `fct


Style nitpicks

This line of code is particularly nasty:

(self.line, self.line_no,
    self.match_type, self.prectxt,
        self.postctxt, self.misc_data) = (line, line_no,
                            match_type, prectxt, postctxt, misc)


Is there a reason you need to assign these values like this? If you assign them separately and in a more readable manner, as I did below, the behaviour of your code should remain the same.

self.line = line
self.line_no = line_no
self.match_type = match_type
self.prectxt = prectxt
self.postctxt = postctxt
self.misc = misc


I also found this chunk of code as well:

def fuzzy_grep(needle,            haystack,
    TOLERANCE_BASE   = .3,    CONTEXT_LINES = 2,
    PUNC_IS_JUNK     = True,  JUNK_FUNC     = None,
    CONSIDER_CASE    = False, ADJUST_BYLEN  = True,
    APPROX_THRESHOLD = .5
    ):


It's hard to write function definitions with that many arguments in Python, and as far as I can tell, there's no real "correct" way of writing these. I usually just write them out like this:

def fuzzy_grep(
    needle,            
    haystack,
    TOLERANCE_BASE=0.3,    
    CONTEXT_LINES=2,
    PUNC_IS_JUNK=True,  
    JUNK_FUNC=None,
    CONSIDER_CASE=False, 
    ADJUST_BYLEN=True,
    APPROX_THRESHOLD=0.5):
    ...


In addition, if you want to align the parameter value assignments, you can do it like this:

def fuzzy_grep(
    needle,            
    haystack,
    TOLERANCE_BASE   = 0.3,    
    CONTEXT_LINES    = 2,
    PUNC_IS_JUNK     = True,  
    JUNK_FUNC        = None,
    CONSIDER_CASE    = False, 
    ADJUST_BYLEN     = True,
    APPROX_THRESHOLD = 0.5):


While it takes up much more space, it's quite easier to read, and much, much more clean overall.

There are a number of places where you're shortening your variable names when they don't need to be shortened. A few examples might be:

  • fname versus filename



  • idx versus index



  • hstk versus haystack



There are other examples. In general, you shouldn't shorten variable names when they don't need to be shortened. It only takes away from the readability and maintainability of your code.

Other than that, I don't see much else that's too much of an issue.

Code Snippets

fio = open(fname, "r")
fct = fio.read()
fio.close()
with open(fname, "r") as fio:
    fct = fio.read()

# continue to do things with `fct`
fio = open(fname, "r")

try:
    fct = fio.read()
finally:
    fio.close()

# Do more stuff with `fct
(self.line, self.line_no,
    self.match_type, self.prectxt,
        self.postctxt, self.misc_data) = (line, line_no,
                            match_type, prectxt, postctxt, misc)
self.line = line
self.line_no = line_no
self.match_type = match_type
self.prectxt = prectxt
self.postctxt = postctxt
self.misc = misc

Context

StackExchange Code Review Q#122215, answer score: 2

Revisions (0)

No revisions yet.