HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMajor

Python package for handwriting recognition

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
recognitionpackagehandwritingforpython

Problem

I am currently writing my bachelor's thesis about on-line handwriting recognition. This is not OCR, because I have the information how a symbol is written as a list of pen trajectory coordinates (x, y).

I call this hwrt - handwriting recognition toolkit. It has a documentation and a friend of mine got the "first steps" to work on his computer.

However, it is the first time I wrote a Python package of which I hope that others might use it. I hope to get general feedback about this project.

The project is hosted at GitHub and has the following structure:

.
├── bin
├── dist
├── docs
├── hwrt
│   ├── misc
│   └── templates
└── tests
    └── symbols


I has some nosetests (not enough, I am working on it).

One of the files in bin is view.py. It allows users to take a look at the data they have previously downloaded (see "first steps" in my documentation).

setup.py

```
try:
from setuptools import setup
except ImportError:
from distutils.core import setup

config = {
'name': 'hwrt',
'version': '0.1.125',
'author': 'Martin Thoma',
'author_email': 'info@martin-thoma.de',
'packages': ['hwrt'],
'scripts': ['bin/backup.py', 'bin/view.py', 'bin/download.py',
'bin/test.py', 'bin/train.py', 'bin/analyze_data.py',
'bin/hwrt', 'bin/record.py'],
'package_data': {'hwrt': ['templates/', 'misc/']},
'url': 'https://github.com/MartinThoma/hwrt',
'license': 'MIT',
'description': 'Handwriting Recognition Tools',
'long_description': """A tookit for handwriting recognition. It was
developed as part of the bachelors thesis of Martin Thoma.""",
'install_requires': [
"argparse",
"theano",
"nose",
"natsort",
"PyYAML",
"matplotlib",
"shapely"
],
'keywords': ['HWRT', 'recognition', 'handwriting', 'on-line'],
'download_url': 'https://github.com/MartinThoma/hwrt',
'classifiers': ['Development Status :: 3 - Alpha',

Solution

Overall I think this is good quality, if a little dense code, by which I
mean that since there is lots of functionality there, it's not the
easiest code to follow. One suggestion I have for that is more grouping
according to topics. So e.g. the utilities could just have a couple of
sections for files, string formatting, calling external programs an so
on.

Since you work with a lot of external files, I imagine some sort of
wrapper object to handle a project or so could work as well.

(Frankly, I may just submit pull requests now, that's easier, heh.)

Code

While running the tests, I've found that the import of open from
future.builtins doesn't work (Python 2.7.9), since there is no such
module / future_builtins doesn't have open either, which means that
nntoolkit can't be loaded and serve.py:19 also throws an error. I
don't what causes this, as you already have Travis CI setup, I'll see
if I can get to the root cause of that.

IMO pickle isn't the nicest format for longterm data files; however at
this point this is more of a reflex for me, if it works for you, than
sure, why not (although you already have at least one workaround, the
sys.modules part, so keep that in mind I think).

For speed you can use ujson as a drop-in replacement of json.

In data_analyzation_metrics.py:119 a raw escape string for color(?) is
used. I'd additionally want a global flag to disable colors and it
would be good to use a library for the formatting (I saw colorterm and
termcolor; there may be others).

For something like "%s" % str(x) the str isn't necessary.

You're already doing this in some places, so I'd suggest using
with open("foo") as file: all the time (if possible).

Instead of self.__repr__() repr(self) looks cleaner.

In features.py:174 the factors 2 and 3 should be extracted,
e.g. something like draw_width = 3 if self.pen_down else 2 or so; in
general extracting common subexpressions (even len(x)) can eliminate a
lot of code, so I won't look for other examples here.

In general, if you have if foo: return, you don't need an else, just
remove that indentation; also returning early can remove lots of
indentation.

For HandwrittenData.py:208 the whole method could just be reduce to:

def __eq__(self, other):
    return isinstance(other, self.__class__) \
        and self.__dict__ == other.__dict__


euclidean_distance from preprocessing.py:30 is also defined as
scipy.spatial.distance.euclidean, so if in some cases you run that
over more than a few elements you could consider using that.

preprocessing.py:497 could be neater with
for index, point in enumerate(pointlist):.

which in selfcheck.py should already exist somewhere ...?

Instead of using zip you can also use itertools.izip, the generator
variant, to use less memory when possible, i.e. when only iterating over
something with a for loop instead of storing the resulting list.

Packaging

The setup.py has a version number, but the Git repository has no
corresponding tags/releases. If you start now and add e.g. 0.1.207 as
the first release (or so), that'd make it easier to refer to specific
version, i.e. from install scripts.

The long_description has a newline, that might look weird in some
circumstances.

install_requires lists at least one package which is not in the
requirements.txt, so I'd sync those, unless it's really only required
for installation, in that case this point is moot.

The keywords look good, except that I'd doubt that including 'HWRT'
helps if the package is already named that way, and 'on-line' probably
doesn't need the hyphen(it does, see comments).

The classifiers are good; you could also list more specific versions of
Python 3.

The requirements.txt has no version requirements. I'd say that at
least some lower bound (i.e. your currently installed packages or so)
would be useful to have. Of course you might not precisely know which
versions to go for, but for someone trying to get it running this would
be helpful nonetheless.

I don't exactly know the process for external requirements, but you
could just note somewhere that ImageMagick is a dependency.

You also fixed the PEP8 stuff, so now there's only a few too-long lines
left; you could also add the pep8 as a pre-commit hook, so you
wouldn't be able to check in without everything fixed; I use that for
library code at least. Same goes for pylint; I'd maybe disable a few
of those things (and add it to the Makefile or again as a pre-commit
hook).

Tests

Great! There is a bit of duplication, e.g. compare_pointlists is
implemented three times. If possible I'd move that into (yet another,
heh) utils package just to get it out of the way.

Using nosetests and the addition of the Makefile is nice as well.

Future ideas

Well, I like PostgreSQL, so I think this will come up at some point; if
you don't have a pressing reason to use MySQL exclusively, using a
database independent library wou

Code Snippets

def __eq__(self, other):
    return isinstance(other, self.__class__) \
        and self.__dict__ == other.__dict__

Context

StackExchange Code Review Q#68527, answer score: 26

Revisions (0)

No revisions yet.