patternpythonMajor
Python package for handwriting recognition
Viewed 0 times
recognitionpackagehandwritingforpython
Problem
I am currently writing my bachelor's thesis about on-line handwriting recognition. This is not OCR, because I have the information how a symbol is written as a list of pen trajectory coordinates (x, y).
I call this
However, it is the first time I wrote a Python package of which I hope that others might use it. I hope to get general feedback about this project.
The project is hosted at GitHub and has the following structure:
I has some nosetests (not enough, I am working on it).
One of the files in
setup.py
```
try:
from setuptools import setup
except ImportError:
from distutils.core import setup
config = {
'name': 'hwrt',
'version': '0.1.125',
'author': 'Martin Thoma',
'author_email': 'info@martin-thoma.de',
'packages': ['hwrt'],
'scripts': ['bin/backup.py', 'bin/view.py', 'bin/download.py',
'bin/test.py', 'bin/train.py', 'bin/analyze_data.py',
'bin/hwrt', 'bin/record.py'],
'package_data': {'hwrt': ['templates/', 'misc/']},
'url': 'https://github.com/MartinThoma/hwrt',
'license': 'MIT',
'description': 'Handwriting Recognition Tools',
'long_description': """A tookit for handwriting recognition. It was
developed as part of the bachelors thesis of Martin Thoma.""",
'install_requires': [
"argparse",
"theano",
"nose",
"natsort",
"PyYAML",
"matplotlib",
"shapely"
],
'keywords': ['HWRT', 'recognition', 'handwriting', 'on-line'],
'download_url': 'https://github.com/MartinThoma/hwrt',
'classifiers': ['Development Status :: 3 - Alpha',
I call this
hwrt - handwriting recognition toolkit. It has a documentation and a friend of mine got the "first steps" to work on his computer.However, it is the first time I wrote a Python package of which I hope that others might use it. I hope to get general feedback about this project.
The project is hosted at GitHub and has the following structure:
.
├── bin
├── dist
├── docs
├── hwrt
│ ├── misc
│ └── templates
└── tests
└── symbolsI has some nosetests (not enough, I am working on it).
One of the files in
bin is view.py. It allows users to take a look at the data they have previously downloaded (see "first steps" in my documentation).setup.py
```
try:
from setuptools import setup
except ImportError:
from distutils.core import setup
config = {
'name': 'hwrt',
'version': '0.1.125',
'author': 'Martin Thoma',
'author_email': 'info@martin-thoma.de',
'packages': ['hwrt'],
'scripts': ['bin/backup.py', 'bin/view.py', 'bin/download.py',
'bin/test.py', 'bin/train.py', 'bin/analyze_data.py',
'bin/hwrt', 'bin/record.py'],
'package_data': {'hwrt': ['templates/', 'misc/']},
'url': 'https://github.com/MartinThoma/hwrt',
'license': 'MIT',
'description': 'Handwriting Recognition Tools',
'long_description': """A tookit for handwriting recognition. It was
developed as part of the bachelors thesis of Martin Thoma.""",
'install_requires': [
"argparse",
"theano",
"nose",
"natsort",
"PyYAML",
"matplotlib",
"shapely"
],
'keywords': ['HWRT', 'recognition', 'handwriting', 'on-line'],
'download_url': 'https://github.com/MartinThoma/hwrt',
'classifiers': ['Development Status :: 3 - Alpha',
Solution
Overall I think this is good quality, if a little dense code, by which I
mean that since there is lots of functionality there, it's not the
easiest code to follow. One suggestion I have for that is more grouping
according to topics. So e.g. the utilities could just have a couple of
sections for files, string formatting, calling external programs an so
on.
Since you work with a lot of external files, I imagine some sort of
wrapper object to handle a project or so could work as well.
(Frankly, I may just submit pull requests now, that's easier, heh.)
Code
While running the tests, I've found that the import of
module /
nntoolkit can't be loaded and
don't what causes this, as you already have Travis CI setup, I'll see
if I can get to the root cause of that.
IMO
this point this is more of a reflex for me, if it works for you, than
sure, why not (although you already have at least one workaround, the
For speed you can use
In
used. I'd additionally want a global flag to disable colors and it
would be good to use a library for the formatting (I saw colorterm and
termcolor; there may be others).
For something like
You're already doing this in some places, so I'd suggest using
Instead of
In
e.g. something like
general extracting common subexpressions (even
lot of code, so I won't look for other examples here.
In general, if you have
remove that indentation; also returning early can remove lots of
indentation.
For
over more than a few elements you could consider using that.
Instead of using
variant, to use less memory when possible, i.e. when only iterating over
something with a
Packaging
The
corresponding tags/releases. If you start now and add e.g.
the first release (or so), that'd make it easier to refer to specific
version, i.e. from install scripts.
The
circumstances.
for installation, in that case this point is moot.
The keywords look good, except that I'd doubt that including
helps if the package is already named that way, and
doesn't need the hyphen(it does, see comments).
The classifiers are good; you could also list more specific versions of
Python 3.
The
least some lower bound (i.e. your currently installed packages or so)
would be useful to have. Of course you might not precisely know which
versions to go for, but for someone trying to get it running this would
be helpful nonetheless.
I don't exactly know the process for external requirements, but you
could just note somewhere that ImageMagick is a dependency.
You also fixed the PEP8 stuff, so now there's only a few too-long lines
left; you could also add the
wouldn't be able to check in without everything fixed; I use that for
library code at least. Same goes for
of those things (and add it to the Makefile or again as a pre-commit
hook).
Tests
Great! There is a bit of duplication, e.g.
implemented three times. If possible I'd move that into (yet another,
heh) utils package just to get it out of the way.
Using nosetests and the addition of the Makefile is nice as well.
Future ideas
Well, I like PostgreSQL, so I think this will come up at some point; if
you don't have a pressing reason to use MySQL exclusively, using a
database independent library wou
mean that since there is lots of functionality there, it's not the
easiest code to follow. One suggestion I have for that is more grouping
according to topics. So e.g. the utilities could just have a couple of
sections for files, string formatting, calling external programs an so
on.
Since you work with a lot of external files, I imagine some sort of
wrapper object to handle a project or so could work as well.
(Frankly, I may just submit pull requests now, that's easier, heh.)
Code
While running the tests, I've found that the import of
open fromfuture.builtins doesn't work (Python 2.7.9), since there is no suchmodule /
future_builtins doesn't have open either, which means thatnntoolkit can't be loaded and
serve.py:19 also throws an error. Idon't what causes this, as you already have Travis CI setup, I'll see
if I can get to the root cause of that.
IMO
pickle isn't the nicest format for longterm data files; however atthis point this is more of a reflex for me, if it works for you, than
sure, why not (although you already have at least one workaround, the
sys.modules part, so keep that in mind I think).For speed you can use
ujson as a drop-in replacement of json.In
data_analyzation_metrics.py:119 a raw escape string for color(?) isused. I'd additionally want a global flag to disable colors and it
would be good to use a library for the formatting (I saw colorterm and
termcolor; there may be others).
For something like
"%s" % str(x) the str isn't necessary.You're already doing this in some places, so I'd suggest using
with open("foo") as file: all the time (if possible).Instead of
self.__repr__() repr(self) looks cleaner.In
features.py:174 the factors 2 and 3 should be extracted,e.g. something like
draw_width = 3 if self.pen_down else 2 or so; ingeneral extracting common subexpressions (even
len(x)) can eliminate alot of code, so I won't look for other examples here.
In general, if you have
if foo: return, you don't need an else, justremove that indentation; also returning early can remove lots of
indentation.
For
HandwrittenData.py:208 the whole method could just be reduce to:def __eq__(self, other):
return isinstance(other, self.__class__) \
and self.__dict__ == other.__dict__euclidean_distance from preprocessing.py:30 is also defined asscipy.spatial.distance.euclidean, so if in some cases you run thatover more than a few elements you could consider using that.
preprocessing.py:497 could be neater withfor index, point in enumerate(pointlist):.which in selfcheck.py should already exist somewhere ...?Instead of using
zip you can also use itertools.izip, the generatorvariant, to use less memory when possible, i.e. when only iterating over
something with a
for loop instead of storing the resulting list.Packaging
The
setup.py has a version number, but the Git repository has nocorresponding tags/releases. If you start now and add e.g.
0.1.207 asthe first release (or so), that'd make it easier to refer to specific
version, i.e. from install scripts.
The
long_description has a newline, that might look weird in somecircumstances.
install_requires lists at least one package which is not in therequirements.txt, so I'd sync those, unless it's really only requiredfor installation, in that case this point is moot.
The keywords look good, except that I'd doubt that including
'HWRT'helps if the package is already named that way, and
'on-line' probablydoesn't need the hyphen(it does, see comments).
The classifiers are good; you could also list more specific versions of
Python 3.
The
requirements.txt has no version requirements. I'd say that atleast some lower bound (i.e. your currently installed packages or so)
would be useful to have. Of course you might not precisely know which
versions to go for, but for someone trying to get it running this would
be helpful nonetheless.
I don't exactly know the process for external requirements, but you
could just note somewhere that ImageMagick is a dependency.
You also fixed the PEP8 stuff, so now there's only a few too-long lines
left; you could also add the
pep8 as a pre-commit hook, so youwouldn't be able to check in without everything fixed; I use that for
library code at least. Same goes for
pylint; I'd maybe disable a fewof those things (and add it to the Makefile or again as a pre-commit
hook).
Tests
Great! There is a bit of duplication, e.g.
compare_pointlists isimplemented three times. If possible I'd move that into (yet another,
heh) utils package just to get it out of the way.
Using nosetests and the addition of the Makefile is nice as well.
Future ideas
Well, I like PostgreSQL, so I think this will come up at some point; if
you don't have a pressing reason to use MySQL exclusively, using a
database independent library wou
Code Snippets
def __eq__(self, other):
return isinstance(other, self.__class__) \
and self.__dict__ == other.__dict__Context
StackExchange Code Review Q#68527, answer score: 26
Revisions (0)
No revisions yet.