HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Find files with content matching regex

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
withfilesfindcontentregexmatching

Problem

Today I wanted to find a program that I wrote a while ago. I knew that it contained a certain regex, but I couldn't for the life of me remember the file name I saved it under. I knew I could use Windows search, but it takes more time then it would for me to write a Python program to do the same.

The main two things I use are os.walk and re, the former to traverse the entire directory tree, where the latter is to match the data. I also use codecs to allow me to read files with special characters. And finally I use argparse to get the input from the end user.

Some files still error when using codecs such as pngs or other raw data files, so I skip these.

I kept the arguments simple, you pass a regex and a path. You can also pass any of the regex flags. So the following will search for 'metaclass' in any case, in the files below 'D:\data'.

python search.py "metaclass" "D:\data" -i


The code is fairly small and mostly just adds information to the parser. It also runs in both Python2 and Python3.

``
import re
import codecs
import argparse
import operator
from os import walk
from os.path import join
# Add reduce to global scope for Python3
try:
from functools import reduce
except ImportError:
pass

# Descriptions are the same as Python's re descriptions
# https://docs.python.org/2.7/library/re.html#module-contents
# https://docs.python.org/3.5/library/re.html#module-contents
parser = argparse.ArgumentParser(description='Search file contense.')
parser.add_argument('regex', help='regex to search for')
parser.add_argument('path', help='path to root of recursive search')
parser.add_argument('-a', '--ascii', action="store_true",
help='(Python3 only) Make
\w, \W, \b, \B, \d, '
'
\D, \s and \S` perform ASCII-only matching '
'instead of full Unicode matching. This is only '
'meaningful for Unicode patterns, and is ignored for '

Solution

Your try block for importing reduce is unnecessary. In Python 2, it is still in the functools module, but it is also in the __builtin__ module.

You have a typo in your description. It should be 'contents', not 'contense'.

Since ASCII is a Python3-only flag, you might want to account for that in get_args(). It really isn't very complicated. Just add:

if args['ascii']:
    try:
        re.ASCII
    except AttributeError:
        parser.error("--ascii is compatible with Python 3 only")


I think get_args() is fine in how much it does. A regex of th(kl is invalid. Invalid arguments should be caught in the function that gets the arguments. I would, however, add a function that determines if a given regex is found in a file. That way get_files() could look like this:

def get_files(path, regex):
    return (name
        for root, dirs, files in os.walk(path)
            for name in files
                if file_matches(file, regex)
    )


From How do I re.search or re.match on a whole file without reading it all into memory?, you can use mmap.mmap to save on memory usage. Note that Python 3 requires a bytes regex when using that function.

Code Snippets

if args['ascii']:
    try:
        re.ASCII
    except AttributeError:
        parser.error("--ascii is compatible with Python 3 only")
def get_files(path, regex):
    return (name
        for root, dirs, files in os.walk(path)
            for name in files
                if file_matches(file, regex)
    )

Context

StackExchange Code Review Q#139423, answer score: 6

Revisions (0)

No revisions yet.