patternsqlMinor
Use case for hstore
Viewed 0 times
caseusehstorefor
Problem
This is the first time (I think) I have the chance to use the
Now, there's this web app in which we import payroll data in the form of XML files, which look roughly like this simplified version:
Each employee carries extremely detailed information, and when I say "several other hundreds of tags" it's because they will usually range between 800 and upwards of 1400. And this is for each month. Apart from a set of core tags, each employee can have a different combination of them, hence the very fluctuating but very realistic figures I gave above.
Now, part of this data is imported with a long, slow and very complex process, and I've observed that, with increasing frequency, we are finding ourselves saying "gosh, if only we had always imported that particular tag!".
While the import process is highly configurable, making it run for just a small set of data is slow, impractical and downright painful. Adding whatever customization is needed to import the hypothetical new tag from now on is much easier, but for building up historic data as if we always imported it, it's messy and error prone.
As an added bonus, the task always falls on those two guys, and I'm one of them, and I would love to make our life a tad simpler.
This is why I'm thinking of writing a quick tool that, during the night, will crack those XML files open and, for each month and each employee, create a record with an
Being an absolute beginner at
-
since the tags can differ for each employee, this is essentially schema-less data.
-
storing the tags as an EAV with one row per tag would mean approximately 240k rows per month f
hstore data type, but I would like to hear from more experienced folks if what I have in mind is actually a good idea.Now, there's this web app in which we import payroll data in the form of XML files, which look roughly like this simplified version:
Smith
John
9999.99
Each employee carries extremely detailed information, and when I say "several other hundreds of tags" it's because they will usually range between 800 and upwards of 1400. And this is for each month. Apart from a set of core tags, each employee can have a different combination of them, hence the very fluctuating but very realistic figures I gave above.
Now, part of this data is imported with a long, slow and very complex process, and I've observed that, with increasing frequency, we are finding ourselves saying "gosh, if only we had always imported that particular tag!".
While the import process is highly configurable, making it run for just a small set of data is slow, impractical and downright painful. Adding whatever customization is needed to import the hypothetical new tag from now on is much easier, but for building up historic data as if we always imported it, it's messy and error prone.
As an added bonus, the task always falls on those two guys, and I'm one of them, and I would love to make our life a tad simpler.
This is why I'm thinking of writing a quick tool that, during the night, will crack those XML files open and, for each month and each employee, create a record with an
hstore column containing all of the employee's tags for that month.Being an absolute beginner at
hstore, this looks to me like a very good use case for it, especially if we consider that:-
since the tags can differ for each employee, this is essentially schema-less data.
-
storing the tags as an EAV with one row per tag would mean approximately 240k rows per month f
Solution
At the scope at which you're working. I think JSONB is ideal. It handles deeply nested structures and structures with array-keys. It's also standardized and in the spec for sql2016.
In addition, as I answered here, there is an extension that should help you with space consumption called ZSON,
ZSON is a PostgreSQL extension for transparent JSONB compression. Compression is based on a shared dictionary of strings most frequently used in specific JSONB documents (not only keys, but also values, array elements, etc).
In some cases ZSON can save half of your disk space and give you about 10% more TPS. Memory is saved as well. See docs/benchmark.md. Everything depends on your data and workload, though. Don't believe any benchmarks, re-check everything on your data, configuration, hardware, workload and PostgreSQL version.
You may want to look into ZSON.
In addition, as I answered here, there is an extension that should help you with space consumption called ZSON,
ZSON is a PostgreSQL extension for transparent JSONB compression. Compression is based on a shared dictionary of strings most frequently used in specific JSONB documents (not only keys, but also values, array elements, etc).
In some cases ZSON can save half of your disk space and give you about 10% more TPS. Memory is saved as well. See docs/benchmark.md. Everything depends on your data and workload, though. Don't believe any benchmarks, re-check everything on your data, configuration, hardware, workload and PostgreSQL version.
You may want to look into ZSON.
Context
StackExchange Database Administrators Q#122950, answer score: 2
Revisions (0)
No revisions yet.