HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Python library for awk-like file manipulation

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fileawklikemanipulationforpythonlibrary

Problem

I recently published a library for advanced awk-like file manipulation in Python 3. The code can be found here and here is the documentation. It is also available for download from pip (pip install awk). I would like to know if the code is well designed and how it can be improved to enforce readability and code reuse. I would also like to know if efficiency can be improved keeping in mind that it should be able to handle large files.

```
import re
from itertools import zip_longest
from collections import OrderedDict

class FileNotOpenException(Exception):
pass

class FieldNotFoundException(Exception):
pass

DEFAULT_FIELD_SEP = r'\s+'

def _DEFAULT_FIELD_FUNC(field_key, field):
return field

def _DEFAULT_FIELD_FILTER(field_key, field):
return True

def _DEFAULT_RECORD_FUNC(NR, record):
return record

def _DEFAULT_RECORD_FILTER(NR, record):
return True

class Record(object):

def __init__(self):
"""Initialises a Record object"""
self._field_dict = {}
self._field_list = []
self._key_list = []
self._iterator = None

def __getitem__(self, key):
"""Allows access to fields in the following forms:
- record[2] # column indices start from 0
- record[4:7:2] # same as above
- record['$4'] # same as record[3]
- record['mykey'] # columns are indexed based on header, if present
"""
try:
try:
return self._field_dict[key]
except (KeyError, TypeError): # nonexisting key or slice, respectively
return self._field_list[key]
except IndexError:
raise FieldNotFoundException('No field {} in record'.format(key))

def __setitem__(self, key, val):
"""should never be done manually, better create a new record than modifying an existing one"""
self._field_dict[key] = val
self._key_list.append(key)
self._field_list.append(val)

def add(self, val):

Solution

Concept

In many ways, the functionality of this library resembles that of the built-in csv module. The main difference is that here you split by regex rather than on a specific character. I think that the design would be improved by modelling your code after the csv module — for example, by having a separate Reader and DictReader.

The fact that the Reader accepts a filename as input limits the applicability of this code. What if I want to parse data coming from a network stream? It can't be done without first writing to a temporary file.

The field numbering convention is very confusing in my opinion:

"""
- record['$4']  # same as record[3]
"""


record['$0'] doesn't retrieve the original text as I would expect.

You should either give up the AWK-inspired '$4' notation (for which I don't see much value) or fully embrace the one-based column numbering (which does have some precedent in Python regular expressions).

The filter functions make the Parser do much more than parsing, violating the Single Responsibility Principle. In addition, the filtering makes it unclear how record numbering works, or what you mean by the "next" record. I think you would be better off dropping the feature, since Python's generator expressions offer much of the same functionality.

Iterators

Your iterator implementation is more complicated than necessary, and in fact wrong.

Here's how iterators should behave:

>>> words = 'The quick brown fox jumps over the lazy dog'.split()
>>> iter1 = iter(words)
>>> iter2 = iter(words)
>>> next(iter1)
'The'
>>> next(iter1)
'quick'
>>> next(iter1)
'brown'
>>> next(iter2)
'The'


However, if I ask for two iterators on the same Record, they actually interfere with each other:

>>> from awk import Reader
>>> with Reader('fox.txt') as reader:
...     record = next(reader)
... 
>>> str(record)
'Record($1: The, $2: quick, $3: brown, $4: fox, $5: jumps, $6: over, $7: the, $8: lazy, $9: dog)'
>>> iter1 = iter(record)
>>> iter2 = iter(record)
>>> next(iter1)
('$1', 'The')
>>> next(iter1)
('$2', 'quick')
>>> next(iter1)
('$3', 'brown')
>>> next(iter2)
('$4', 'fox')


To support iteration, you didn't need to write a __next__ method; all you needed was this:

class Record:
    …

    def __iter__(self):
        """Return an iterator over the record's keys"""
        return ((key, self._field_dict[key]) for key in self._key_list)

Code Snippets

"""
- record['$4']  # same as record[3]
"""
>>> words = 'The quick brown fox jumps over the lazy dog'.split()
>>> iter1 = iter(words)
>>> iter2 = iter(words)
>>> next(iter1)
'The'
>>> next(iter1)
'quick'
>>> next(iter1)
'brown'
>>> next(iter2)
'The'
>>> from awk import Reader
>>> with Reader('fox.txt') as reader:
...     record = next(reader)
... 
>>> str(record)
'Record($1: The, $2: quick, $3: brown, $4: fox, $5: jumps, $6: over, $7: the, $8: lazy, $9: dog)'
>>> iter1 = iter(record)
>>> iter2 = iter(record)
>>> next(iter1)
('$1', 'The')
>>> next(iter1)
('$2', 'quick')
>>> next(iter1)
('$3', 'brown')
>>> next(iter2)
('$4', 'fox')
class Record:
    …

    def __iter__(self):
        """Return an iterator over the record's keys"""
        return ((key, self._field_dict[key]) for key in self._key_list)

Context

StackExchange Code Review Q#145994, answer score: 2

Revisions (0)

No revisions yet.