patternpythonMinor
CR2 raw image file parser in Python 3
Viewed 0 times
fileimageparserrawpythoncr2
Problem
I haven't used Python in a number of years, but recently started working on a project called Photoshell which is written in Python 3.
As I attempted to learn the "Pythonic way", I wrote a parser for Canon's CR2 raw image file format. I'm not sure how "pythonic" the parser is, so I'd love to get some feedback on how it could be improved that way. Also how the speed / caching could be improved when handling large binary blobs (extracted images, in this case). Finaly, I suspect that the file read/writing could also use some work.
```
import os
import struct
tags = {
'image_width': 0x0100,
'image_length': 0x0101,
'bits_per_sample': 0x0102,
'compression': 0x0103,
'make': 0x010f,
'model': 0x0110,
'strip_offset': 0x0111,
'orientation': 0x0112,
'strip_byte_counts': 0x0117,
'x_resolution': 0x011a,
'y_resolution': 0x011b,
'resolution_unit': 0x0128,
'datetime': 0x0132,
'exif': 0x8769,
'gps_data': 0x8825
}
# Mapping of tag types to format strings.
# Format strings that start with a * are too long to fit in the IFD entry and
# are actually a pointer to somewhere else in the file.
tag_types = {
0x1: 'B', # Unsigned char
0x2: '*s', # String (with ending 0)
0x3: 'H', # Unsigned short
0x4: 'L', # Unsigned long
0x5: '*Q', # Unsigned rational
0x6: 'b', # Signed char
0x7: '*p', # Byte sequence
0x8: 'h', # Signed short
0x9: 'l', # Signed long
0xA: '*q', # Signed rational
0xB: '*f', # Float (IEEE)
0xC: '*d', # Double (IEEE)
}
# Format info: http://lclevy.free.fr/cr2/
# The Cr2 class loads a CR2 file from disk. It is currently read-only.
class Cr2(object):
class Header(object):
def __init__(self, fhandle):
assert not fhandle.closed
fhandle.seek(0)
header_buffer = fhandle.read(16)
(endianness,) = struct.unpack_from('H', header_buffer)
if endianness == 0x4949:
# Intel
As I attempted to learn the "Pythonic way", I wrote a parser for Canon's CR2 raw image file format. I'm not sure how "pythonic" the parser is, so I'd love to get some feedback on how it could be improved that way. Also how the speed / caching could be improved when handling large binary blobs (extracted images, in this case). Finaly, I suspect that the file read/writing could also use some work.
```
import os
import struct
tags = {
'image_width': 0x0100,
'image_length': 0x0101,
'bits_per_sample': 0x0102,
'compression': 0x0103,
'make': 0x010f,
'model': 0x0110,
'strip_offset': 0x0111,
'orientation': 0x0112,
'strip_byte_counts': 0x0117,
'x_resolution': 0x011a,
'y_resolution': 0x011b,
'resolution_unit': 0x0128,
'datetime': 0x0132,
'exif': 0x8769,
'gps_data': 0x8825
}
# Mapping of tag types to format strings.
# Format strings that start with a * are too long to fit in the IFD entry and
# are actually a pointer to somewhere else in the file.
tag_types = {
0x1: 'B', # Unsigned char
0x2: '*s', # String (with ending 0)
0x3: 'H', # Unsigned short
0x4: 'L', # Unsigned long
0x5: '*Q', # Unsigned rational
0x6: 'b', # Signed char
0x7: '*p', # Byte sequence
0x8: 'h', # Signed short
0x9: 'l', # Signed long
0xA: '*q', # Signed rational
0xB: '*f', # Float (IEEE)
0xC: '*d', # Double (IEEE)
}
# Format info: http://lclevy.free.fr/cr2/
# The Cr2 class loads a CR2 file from disk. It is currently read-only.
class Cr2(object):
class Header(object):
def __init__(self, fhandle):
assert not fhandle.closed
fhandle.seek(0)
header_buffer = fhandle.read(16)
(endianness,) = struct.unpack_from('H', header_buffer)
if endianness == 0x4949:
# Intel
Solution
The comment
would be better in
I wouldn't do that, though; I'd get rid of the getters and just have attributes. If you really want getters in order to make things read-only, use properties.
Since you're requiring Python 3, remove
I would separate the embedded classes; sticking them together seems relatively pointless.
IMHO,
I personally don't think
Your
It doesn't seem like
I would use a dictionary to calculate
You use
This all gives:
In
which would be more clear as
In
but I don't see a reason to use a slice. In fact, I would do the even stricter
I would change
I would do a single read from
Use
Looking further, I got confused as to why you delayed it anyway;
You also don't check for insufficient read sizes, but I'll let that be for now.
Being non-lazy, you could make
```
_IfdEntryFields = namedtuple("IfdEntryFields", [
"image", "tag_id", "tag_type", "value_len", "raw_value", "value"
])
class IfdEntry(_IfdEntryFields):
__slots__ = ()
def __new__(cls, offset, image):
tag_id, tag_type_key, value_len = image.read_at('HHL', offset)
tag_type = tag_types[tag_type_key]
if tag_type.startswith('*'):
_, tag_type = tag_type
[raw_value] = image.read_at('L', offset + 8)
[value] = image.read_at('{}{}'.format(value_len, tag_type), raw_value)
if tag_type == "s":
value = value.rstrip(b'\0').decode("utf-8")
else:
[raw_value] = i
# Format info: http://lclevy.free.fr/cr2/
# The Cr2 class loads a CR2 file from disk. It is currently read-only.would be better in
Cr2's docstring. "It is currently read-only" doesn't mean much since everything is public. To mark things "private" the convention is to prefix them with underscores (eg. self._header).I wouldn't do that, though; I'd get rid of the getters and just have attributes. If you really want getters in order to make things read-only, use properties.
Since you're requiring Python 3, remove
(object) from the inheritance list; it was only required while transitioning from 2.x.I would separate the embedded classes; sticking them together seems relatively pointless.
IMHO,
[endianness] = ... is nicer than (endianness,) = .... Feel free to ignore this point.I personally don't think
Header should be calling seek on the input file; it should trust the user to give it an appropriately seeked file, even if that may be offset.Your
assert not fhandle.closed isn't needed; read will error just fine if it's closed (with a more descriptive error, too).It doesn't seem like
Header needs fhandle at all, actually; it should be able to receive just a string. FWIW, header_buffer isn't a buffer.I would use a dictionary to calculate
endian_flag.endian_flags = {
0x4949: '', # Motorola
}
# Default to native
endian_flag = endian_flags.get(endianness, "@")Header looks like it could be a namedtuple.You use
unpack_from a lot where unpack alone should work. Prefer the later.This all gives:
from collections import namedtuple
# Mapping from manufacturer to associated endianness as accepted by struct
endian_flags = {
0x4949: '', # Motorola
}
_HeaderFields = namedtuple("HeaderFields", [
"endian_flag", "raw_header", "tiff_magic_word", "tiff_offset",
"magic_word", "major_version", "minor_version", "raw_ifd_offset"
])
class Header(_HeaderFields):
__slots__ = ()
def __new__(cls, header_bytes):
[endianness] = struct.unpack_from('H', header_bytes)
# Default to native
endian_flag = endian_flags.get(endianness, "@")
raw_header = struct.unpack(endian_flag + 'HHLHBBL', header_bytes)
return super().__new__(cls, endian_flag, raw_header, *raw_header[1:])In
IfdEntry, you havetag_types[self.tag_type][0] == '*'which would be more clear as
tag_types[self.tag_type].startswith('*')In
get_value you dotag_fmt = tag_types[self.tag_type][1:]but I don't see a reason to use a slice. In fact, I would do the even stricter
_, tag_fmt = tag_types[self.tag_type]get_value would be better as a property.I would change
self.tag_type to be the result from after translating with tag_types.I would do a single read from
fhandle in __init__ and use unpack_from's offset parameter to read parts. This emphasised how repetitive the unpacking is, so I'd make a local function to reduce the repetition:def __init__(self, num, parent):
self.parent = parent
parent.fhandle.seek(parent.offset + 2 + (12 * num))
entry_bytes = parent.fhandle.read(12)
def unpack_at(type, offset):
return struct.unpack_from(parent.endian_flag + type, entry_bytes, offset)
self.tag_id, tag_type, self.value_len = unpack_at('HHL', 0)
self.tag_type = tag_types[tag_type]
if self.tag_type.startswith('*'):
[self.raw_value] = unpack_at('L', 8)
self._value = -1
else:
[self.raw_value] = unpack_at(self.tag_type, 8)
self._value = self.raw_valueUse
None instead of -1 as a "missing value" sentinel; that's what it's for, after all.Looking further, I got confused as to why you delayed it anyway;
value_len never seems to be big so it makes sense just to do it eagerly. This pointed out what seems to be a bug: self.parent.fhandle.read(self.value_len) is too few bytes when tag_fmt represents a value of > 1 bytes. This kind of error and repetition suggests that you should really have a class to encapsulate reading data from the file. Cr2 would work great for that.You also don't check for insufficient read sizes, but I'll let that be for now.
Being non-lazy, you could make
IfdEntry another immutable namedtuple:```
_IfdEntryFields = namedtuple("IfdEntryFields", [
"image", "tag_id", "tag_type", "value_len", "raw_value", "value"
])
class IfdEntry(_IfdEntryFields):
__slots__ = ()
def __new__(cls, offset, image):
tag_id, tag_type_key, value_len = image.read_at('HHL', offset)
tag_type = tag_types[tag_type_key]
if tag_type.startswith('*'):
_, tag_type = tag_type
[raw_value] = image.read_at('L', offset + 8)
[value] = image.read_at('{}{}'.format(value_len, tag_type), raw_value)
if tag_type == "s":
value = value.rstrip(b'\0').decode("utf-8")
else:
[raw_value] = i
Code Snippets
# Format info: http://lclevy.free.fr/cr2/
# The Cr2 class loads a CR2 file from disk. It is currently read-only.endian_flags = {
0x4949: '<', # Intel
0x4D4D: '>', # Motorola
}
# Default to native
endian_flag = endian_flags.get(endianness, "@")from collections import namedtuple
# Mapping from manufacturer to associated endianness as accepted by struct
endian_flags = {
0x4949: '<', # Intel
0x4D4D: '>', # Motorola
}
_HeaderFields = namedtuple("HeaderFields", [
"endian_flag", "raw_header", "tiff_magic_word", "tiff_offset",
"magic_word", "major_version", "minor_version", "raw_ifd_offset"
])
class Header(_HeaderFields):
__slots__ = ()
def __new__(cls, header_bytes):
[endianness] = struct.unpack_from('H', header_bytes)
# Default to native
endian_flag = endian_flags.get(endianness, "@")
raw_header = struct.unpack(endian_flag + 'HHLHBBL', header_bytes)
return super().__new__(cls, endian_flag, raw_header, *raw_header[1:])tag_types[self.tag_type][0] == '*'tag_types[self.tag_type].startswith('*')Context
StackExchange Code Review Q#75374, answer score: 3
Revisions (0)
No revisions yet.