HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Let's speed that file sentence searching program

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
filesearchingprogramthatletsentencespeed

Problem

Intro:

I've written a small piece of Python program which is looking for a given sentence in multiple sub directories of a given path.

I'm looking for improvements regarding the speed of my script.

Code:

from os import walk
from os.path import join

def get_magik_files(base_path):
    """
    Yields each path from all the base_path subdirectories

    :param base_path: this is the base path from where we'll start looking after .magik files
    :return: yield full path of a .magik file
    """
    for dirpath, _, filenames in walk(base_path):
        for filename in [f for f in filenames if f.endswith(".magik")]:
            yield join(dirpath, filename)

def search_sentence_in_file(base_path, sentence):
    """
    Prints each file path, line and line content where sentence was found

    :param base_path: this is the base path from where we'll start looking after .magik files
    :param sentence: the sentence we're looking up for
    :return: print the file path, line number and line content where sentence was found
    """
    for each_magik_file in get_magik_files(base_path):
        with open(each_magik_file) as magik_file:
            for line_number, line in enumerate(magik_file):
                if sentence in line:
                    print('[# FILE PATH    #] {} ...\n'
                          '[# LINE NUMBER  #] At line  {}\n'
                          '[# LINE CONTENT #] Content: {}'.format(each_magik_file, line_number, line.strip()))
                    print('---------------------------------------------------------------------------------')

def main():
    basepath = r'some_path'
    sentence_to_search = 'some sentence'

    search_sentence_in_file(basepath, sentence_to_search)

if __name__ == '__main__':
    main()


Miscellaneous:

As you may already figured out, the reason for my program being so slow resides in search_sentence_in_file(base_path, sentence) where I need to open each file, read it line by line and look for a specific se

Solution

Yay, PEP 8

72 characters for docstrings, 79 for the code. The rest seems fine.

Separation of concerns

search_sentence_in_file should search, and return its results. Not print, it is the duty of the caller.

I feel it is also wrongly named as it search a sentence in several files. So at least add the missing s at the end of the name. And to make it even more reusable, why not pass an iterable of filepath (like the get_magic_files generator)?

Genericity

Besides search_sentence_in_file accepting an iterable, you could make get_magik_files more generic by passing the required extension as a parameter. This will let you extend your script to allow search in various kind of files.

First rewrite

from os import walk
from os.path import join, splitext

def get_files(base_path, extension=None):
    """
    Yields each path from all the base_path subdirectories

    :param base_path: this is the base path from where the
                      function start looking for relevant files
    :param extension: filter files using provided extension.
                      If None, no filter is applied.
    :return: yield full path of a requested file
    """
    if extension is None:
        def filter_files(filenames):
            yield from filenames
    else:
        def filter_files(filenames):
            for filename in filenames:
                if splitext(filename)[1] == extension:
                    yield filename

    for dirpath, _, filenames in walk(base_path):
        for filename in filter_files(filenames):
            yield join(dirpath, filename)

def search_sentence_in_files(files, sentence):
    """
    Yield each file path, line and line content where
    sentence was found.

    :param files: iterable of files to search the sentence into
    :param sentence: the sentence we're looking up for
    :return: yield the file path, line number and line
             content where sentence was found
    """
    for filepath in files:
        with open(filepath) as fp:
            for line_number, line in enumerate(fp):
                if sentence in line:
                    yield filepath, line_number, line.strip()

def main():
    basepath = r'some_path'
    sentence_to_search = 'some sentence'

    files = get_files(basepath, 'magik')
    results = search_sentence_in_files(files, sentence_to_search)
    for filepath, line, content in results:
        print('[# FILE PATH    #]', filepath, '...')
        print('[# LINE NUMBER  #] At line', line)
        print('[# LINE CONTENT #] Content:', content)
        print('-'*80)

if __name__ == '__main__':
    main()


Reusability

Your script make it hard to reuse for other purposes: different sentences, different kind of files. Better to add a CLI using argparse. Provide sensible default for your current usage but allows for customization at will.

```
from os import walk
from os.path import join, splitext
import argparse

def get_files(base_path, extension=None):
"""
Yields each path from all the base_path subdirectories

:param base_path: this is the base path from where the
function start looking for relevant files
:param extension: filter files using provided extension.
If None, no filter is applied.
:return: yield full path of a requested file
"""
if extension is None:
def filter_files(filenames):
yield from filenames
else:
def filter_files(filenames):
for filename in filenames:
if splitext(filename)[1] == extension:
yield filename

for dirpath, _, filenames in walk(base_path):
for filename in filter_files(filenames):
yield join(dirpath, filename)

def search_sentence_in_files(files, sentence):
"""
Yield each file path, line and line content where
sentence was found.

:param files: iterable of files to search the sentence into
:param sentence: the sentence we're looking up for
:return: yield the file path, line number and line
content where sentence was found
"""
for filepath in files:
with open(filepath) as fp:
for line_number, line in enumerate(fp):
if sentence in line:
yield filepath, line_number, line.strip()

def main(files, sentence):
results = search_sentence_in_files(files, sentence)
for filepath, line, content in results:
print('[# FILE PATH #]', filepath, '...')
print('[# LINE NUMBER #] At line', line)
print('[# LINE CONTENT #] Content:', content)
print('-'*80)

if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Search text in files')
parser.add_argument('sentence')
parser.add_argument('-p', '--basepath',
help='folder in wich files will be examinated',
default=r'some folder')
parser.add_argument('-e', '--extension',

Code Snippets

from os import walk
from os.path import join, splitext


def get_files(base_path, extension=None):
    """
    Yields each path from all the base_path subdirectories

    :param base_path: this is the base path from where the
                      function start looking for relevant files
    :param extension: filter files using provided extension.
                      If None, no filter is applied.
    :return: yield full path of a requested file
    """
    if extension is None:
        def filter_files(filenames):
            yield from filenames
    else:
        def filter_files(filenames):
            for filename in filenames:
                if splitext(filename)[1] == extension:
                    yield filename

    for dirpath, _, filenames in walk(base_path):
        for filename in filter_files(filenames):
            yield join(dirpath, filename)


def search_sentence_in_files(files, sentence):
    """
    Yield each file path, line and line content where
    sentence was found.

    :param files: iterable of files to search the sentence into
    :param sentence: the sentence we're looking up for
    :return: yield the file path, line number and line
             content where sentence was found
    """
    for filepath in files:
        with open(filepath) as fp:
            for line_number, line in enumerate(fp):
                if sentence in line:
                    yield filepath, line_number, line.strip()


def main():
    basepath = r'some_path'
    sentence_to_search = 'some sentence'

    files = get_files(basepath, 'magik')
    results = search_sentence_in_files(files, sentence_to_search)
    for filepath, line, content in results:
        print('[# FILE PATH    #]', filepath, '...')
        print('[# LINE NUMBER  #] At line', line)
        print('[# LINE CONTENT #] Content:', content)
        print('-'*80)


if __name__ == '__main__':
    main()
from os import walk
from os.path import join, splitext
import argparse


def get_files(base_path, extension=None):
    """
    Yields each path from all the base_path subdirectories

    :param base_path: this is the base path from where the
                      function start looking for relevant files
    :param extension: filter files using provided extension.
                      If None, no filter is applied.
    :return: yield full path of a requested file
    """
    if extension is None:
        def filter_files(filenames):
            yield from filenames
    else:
        def filter_files(filenames):
            for filename in filenames:
                if splitext(filename)[1] == extension:
                    yield filename

    for dirpath, _, filenames in walk(base_path):
        for filename in filter_files(filenames):
            yield join(dirpath, filename)


def search_sentence_in_files(files, sentence):
    """
    Yield each file path, line and line content where
    sentence was found.

    :param files: iterable of files to search the sentence into
    :param sentence: the sentence we're looking up for
    :return: yield the file path, line number and line
             content where sentence was found
    """
    for filepath in files:
        with open(filepath) as fp:
            for line_number, line in enumerate(fp):
                if sentence in line:
                    yield filepath, line_number, line.strip()


def main(files, sentence):
    results = search_sentence_in_files(files, sentence)
    for filepath, line, content in results:
        print('[# FILE PATH    #]', filepath, '...')
        print('[# LINE NUMBER  #] At line', line)
        print('[# LINE CONTENT #] Content:', content)
        print('-'*80)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Search text in files')
    parser.add_argument('sentence')
    parser.add_argument('-p', '--basepath',
                        help='folder in wich files will be examinated',
                        default=r'some folder')
    parser.add_argument('-e', '--extension',
                        help='extension of files to examine',
                        default='magik')
    args = parser.parse_args()

    files = get_files(args.basepath, args.extension)
    main(files, args.sentence)

Context

StackExchange Code Review Q#150571, answer score: 12

Revisions (0)

No revisions yet.