HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Apostle Galaxies: dict subclass with disk caching

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
dictdiskwithcachingapostlesubclassgalaxies

Problem

I am an astrophysicist working on large simulations as part of the APOSTLE project. The output of the simulations I use are large (TBs) and are stored in tables spread across multiple hdf5 files. Often, I'm interested in studying a particular simulated galaxy, which is represented by a collection of particles of different types and which have different properties.

Prior to writing this code, I would typically need to read tables corresponding to various properties for all particles in the simulation, then construct a selection to extract only those particles belonging to the galaxy of interest, copy those, and discard the large tables.

Given that I often study one or a few galaxies for a while, I thought it would be a good idea to build a little interface which computes selection "masks" for a given galaxy once, the first time it is run, and caches those to disk.

In addition, it caches any additional properties of the galaxy particles that get pulled out of the big tables, and some associated metadata. My solution is a subclass of dict, so once an ApostleObject (i.e. an abstraction of a galaxy) is initialized, call it AO, I can extract e.g. particle properties from the object by AO['T_g']. (In this example 'T_g' refers to the gas temperature. Another class, ApostleFileset, is aware of all available keys and how to get the raw tables for all particles from the disk.) If 'T_g' has been loaded before it will already be in memory and simply be returned, otherwise it will be loaded from the master tables if the key is valid, otherwise a KeyError is raised.

There are two other features of note:

-
I just implemented __getattr__ as an alternative to __getitem__ since writing AO.T_g is a bit more succinct and readable than AO['T_g'] (though __getitem__ is still useful in many contexts).

-
This one has to do with how the caching is implemented - the dict subclass itself is actually buried inside a thin wrapper that is just a context manager. Th

Solution

From the look of your code, I'd say ApostleFileset and ApostleObject are both god classes.
First _define_masks should be defined in ApostleFileset, but _load_key can almost be moved into it too.
The only problem with _load_key is when the key is a particle, and the first value is xyz, you mutate self, rather than just assign to it.

Also using an if statement, and a couple of function calls may lead to a TOCTOU bug. Which was changed in Python 3 by adding the x mode to open.

Since you don't want any more god classes, you should change the input and output to be as simple as possible.
You want; the cache file location, and a function to get missing values.
This should generate a lock for the file, and open the cache for you.
If you want anymore functionality, then you have to subclass it and add it.
But you should only really add saving and reading to or from the cache.

This can get you:

import os

class Cache(dict):
    def __init__(self, provider):
        self._provider = provider

    def __missing__(self, key):
        v = self._provider(self, key)
        self[key] = v
        return v

class FileCache(Cache):
    def __init__(self, path, provider):
        super(FileCache, self).__init__(provider)
        self._path = path

    def __enter__(self):
        path = self._path + '.lock'
        if not os.path.isfile(path):
            with open(path, 'a') as f:
                pass
        else:
            raise IOError('file {!r} exists.'.format(path))
        self._file = open(self._path, 'a+')
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        self._file.close()
        os.remove(self._path + '.lock')
        return False


And then you want to subclass FileCache for however you want to change ApostleFileset.
As an example, below is an example of how you could extend the class:

class ApostleObject(FileCache):
    def __enter__(self):
        ret = super(ApostleObject, self).__enter__()
        for line in self._file:
            key, value = line.split(' ', 1)
            self[int(key)] = int(value)
        return ret

    def __missing__(self, key):
        value = super(ApostleObject, self).__missing__(key)
        self._file.write('{key} {value}\n'.format(key=key, value=value))
        return value

class ApostleFileset(object):
    def missing(self, cache, key):
        return 2 * key

with ApostleObject('cache', ApostleFileset().missing) as obj:
    print(obj[1])
    print(obj[1])
    print(obj[3])
    # Only to show the content of the file, not for actual use.
    obj._file.seek(0, os.SEEK_SET)
    print(obj._file.read())

with ApostleObject('cache', ApostleFileset().missing) as obj:
    print(obj[2])
    obj._file.seek(0, os.SEEK_SET)
    print(obj._file.read())


Which outputs:

2
2
6
1 2
3 6

4
1 2
3 6
2 4

Code Snippets

import os

class Cache(dict):
    def __init__(self, provider):
        self._provider = provider

    def __missing__(self, key):
        v = self._provider(self, key)
        self[key] = v
        return v


class FileCache(Cache):
    def __init__(self, path, provider):
        super(FileCache, self).__init__(provider)
        self._path = path

    def __enter__(self):
        path = self._path + '.lock'
        if not os.path.isfile(path):
            with open(path, 'a') as f:
                pass
        else:
            raise IOError('file {!r} exists.'.format(path))
        self._file = open(self._path, 'a+')
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        self._file.close()
        os.remove(self._path + '.lock')
        return False
class ApostleObject(FileCache):
    def __enter__(self):
        ret = super(ApostleObject, self).__enter__()
        for line in self._file:
            key, value = line.split(' ', 1)
            self[int(key)] = int(value)
        return ret

    def __missing__(self, key):
        value = super(ApostleObject, self).__missing__(key)
        self._file.write('{key} {value}\n'.format(key=key, value=value))
        return value


class ApostleFileset(object):
    def missing(self, cache, key):
        return 2 * key


with ApostleObject('cache', ApostleFileset().missing) as obj:
    print(obj[1])
    print(obj[1])
    print(obj[3])
    # Only to show the content of the file, not for actual use.
    obj._file.seek(0, os.SEEK_SET)
    print(obj._file.read())

with ApostleObject('cache', ApostleFileset().missing) as obj:
    print(obj[2])
    obj._file.seek(0, os.SEEK_SET)
    print(obj._file.read())
2
2
6
1 2
3 6

4
1 2
3 6
2 4

Context

StackExchange Code Review Q#150445, answer score: 3

Revisions (0)

No revisions yet.