patternpythonMinor
Python library for awk-like file manipulation
Viewed 0 times
fileawklikemanipulationforpythonlibrary
Problem
I recently published a library for advanced awk-like file manipulation in Python 3. The code can be found here and here is the documentation. It is also available for download from pip (
```
import re
from itertools import zip_longest
from collections import OrderedDict
class FileNotOpenException(Exception):
pass
class FieldNotFoundException(Exception):
pass
DEFAULT_FIELD_SEP = r'\s+'
def _DEFAULT_FIELD_FUNC(field_key, field):
return field
def _DEFAULT_FIELD_FILTER(field_key, field):
return True
def _DEFAULT_RECORD_FUNC(NR, record):
return record
def _DEFAULT_RECORD_FILTER(NR, record):
return True
class Record(object):
def __init__(self):
"""Initialises a Record object"""
self._field_dict = {}
self._field_list = []
self._key_list = []
self._iterator = None
def __getitem__(self, key):
"""Allows access to fields in the following forms:
- record[2] # column indices start from 0
- record[4:7:2] # same as above
- record['$4'] # same as record[3]
- record['mykey'] # columns are indexed based on header, if present
"""
try:
try:
return self._field_dict[key]
except (KeyError, TypeError): # nonexisting key or slice, respectively
return self._field_list[key]
except IndexError:
raise FieldNotFoundException('No field {} in record'.format(key))
def __setitem__(self, key, val):
"""should never be done manually, better create a new record than modifying an existing one"""
self._field_dict[key] = val
self._key_list.append(key)
self._field_list.append(val)
def add(self, val):
pip install awk). I would like to know if the code is well designed and how it can be improved to enforce readability and code reuse. I would also like to know if efficiency can be improved keeping in mind that it should be able to handle large files.```
import re
from itertools import zip_longest
from collections import OrderedDict
class FileNotOpenException(Exception):
pass
class FieldNotFoundException(Exception):
pass
DEFAULT_FIELD_SEP = r'\s+'
def _DEFAULT_FIELD_FUNC(field_key, field):
return field
def _DEFAULT_FIELD_FILTER(field_key, field):
return True
def _DEFAULT_RECORD_FUNC(NR, record):
return record
def _DEFAULT_RECORD_FILTER(NR, record):
return True
class Record(object):
def __init__(self):
"""Initialises a Record object"""
self._field_dict = {}
self._field_list = []
self._key_list = []
self._iterator = None
def __getitem__(self, key):
"""Allows access to fields in the following forms:
- record[2] # column indices start from 0
- record[4:7:2] # same as above
- record['$4'] # same as record[3]
- record['mykey'] # columns are indexed based on header, if present
"""
try:
try:
return self._field_dict[key]
except (KeyError, TypeError): # nonexisting key or slice, respectively
return self._field_list[key]
except IndexError:
raise FieldNotFoundException('No field {} in record'.format(key))
def __setitem__(self, key, val):
"""should never be done manually, better create a new record than modifying an existing one"""
self._field_dict[key] = val
self._key_list.append(key)
self._field_list.append(val)
def add(self, val):
Solution
Concept
In many ways, the functionality of this library resembles that of the built-in
The fact that the
The field numbering convention is very confusing in my opinion:
You should either give up the AWK-inspired
The filter functions make the
Iterators
Your iterator implementation is more complicated than necessary, and in fact wrong.
Here's how iterators should behave:
However, if I ask for two iterators on the same
To support iteration, you didn't need to write a
In many ways, the functionality of this library resembles that of the built-in
csv module. The main difference is that here you split by regex rather than on a specific character. I think that the design would be improved by modelling your code after the csv module — for example, by having a separate Reader and DictReader.The fact that the
Reader accepts a filename as input limits the applicability of this code. What if I want to parse data coming from a network stream? It can't be done without first writing to a temporary file.The field numbering convention is very confusing in my opinion:
"""
- record['$4'] # same as record[3]
"""record['$0'] doesn't retrieve the original text as I would expect.You should either give up the AWK-inspired
'$4' notation (for which I don't see much value) or fully embrace the one-based column numbering (which does have some precedent in Python regular expressions).The filter functions make the
Parser do much more than parsing, violating the Single Responsibility Principle. In addition, the filtering makes it unclear how record numbering works, or what you mean by the "next" record. I think you would be better off dropping the feature, since Python's generator expressions offer much of the same functionality.Iterators
Your iterator implementation is more complicated than necessary, and in fact wrong.
Here's how iterators should behave:
>>> words = 'The quick brown fox jumps over the lazy dog'.split()
>>> iter1 = iter(words)
>>> iter2 = iter(words)
>>> next(iter1)
'The'
>>> next(iter1)
'quick'
>>> next(iter1)
'brown'
>>> next(iter2)
'The'However, if I ask for two iterators on the same
Record, they actually interfere with each other:>>> from awk import Reader
>>> with Reader('fox.txt') as reader:
... record = next(reader)
...
>>> str(record)
'Record($1: The, $2: quick, $3: brown, $4: fox, $5: jumps, $6: over, $7: the, $8: lazy, $9: dog)'
>>> iter1 = iter(record)
>>> iter2 = iter(record)
>>> next(iter1)
('$1', 'The')
>>> next(iter1)
('$2', 'quick')
>>> next(iter1)
('$3', 'brown')
>>> next(iter2)
('$4', 'fox')To support iteration, you didn't need to write a
__next__ method; all you needed was this:class Record:
…
def __iter__(self):
"""Return an iterator over the record's keys"""
return ((key, self._field_dict[key]) for key in self._key_list)Code Snippets
"""
- record['$4'] # same as record[3]
""">>> words = 'The quick brown fox jumps over the lazy dog'.split()
>>> iter1 = iter(words)
>>> iter2 = iter(words)
>>> next(iter1)
'The'
>>> next(iter1)
'quick'
>>> next(iter1)
'brown'
>>> next(iter2)
'The'>>> from awk import Reader
>>> with Reader('fox.txt') as reader:
... record = next(reader)
...
>>> str(record)
'Record($1: The, $2: quick, $3: brown, $4: fox, $5: jumps, $6: over, $7: the, $8: lazy, $9: dog)'
>>> iter1 = iter(record)
>>> iter2 = iter(record)
>>> next(iter1)
('$1', 'The')
>>> next(iter1)
('$2', 'quick')
>>> next(iter1)
('$3', 'brown')
>>> next(iter2)
('$4', 'fox')class Record:
…
def __iter__(self):
"""Return an iterator over the record's keys"""
return ((key, self._field_dict[key]) for key in self._key_list)Context
StackExchange Code Review Q#145994, answer score: 2
Revisions (0)
No revisions yet.