patternsqlMinor
Is it poor practice to aggregate data from different tables into one?
Viewed 0 times
tablespracticeintopoordifferentonefromdataaggregate
Problem
Background
I write a lot of large reports for and generally maintain a large health records DB (write SPs, functions, jobs, etc.). The original schema, and software that uses it, is from a different vendor, so I can't change much about it structurally. There are many records that require tracking such as labs, procedures, vaccines, etc. and they are scattered across dozens of tables, many of which are bloated and poorly indexed (I have been able to fix this somewhat).
The Problem
The issue is that because we have little control over the DB, and since it can change from any given update or patch, it makes writing and maintaining these reports difficult and tedious - especially when there's a large amount of overlap. All it takes is one patch and I'm stuck rewriting large portions of a dozen reports. Additionally, queries quickly become obfuscated and slow as the joins, nested selects, and applies pile up.
My "Solution"
My plan was to write all of these records to one "catch-all" table, and write triggers on the original tables to maintain the records in this aggregate table. Of course I'd need to ensure my triggers were intact after updates, but this would be much easier from a maintainability standpoint, and just referencing the data.
The table would be thin and long, storing only the data required, something like this:
Then I'd have various relational tables for things like the type_id and item groupings.
I'm starting to second guess this idea as several of these tables are written to quite a bit, the SPs and reports I'd be writing would reference the data a lot as well.
I write a lot of large reports for and generally maintain a large health records DB (write SPs, functions, jobs, etc.). The original schema, and software that uses it, is from a different vendor, so I can't change much about it structurally. There are many records that require tracking such as labs, procedures, vaccines, etc. and they are scattered across dozens of tables, many of which are bloated and poorly indexed (I have been able to fix this somewhat).
The Problem
The issue is that because we have little control over the DB, and since it can change from any given update or patch, it makes writing and maintaining these reports difficult and tedious - especially when there's a large amount of overlap. All it takes is one patch and I'm stuck rewriting large portions of a dozen reports. Additionally, queries quickly become obfuscated and slow as the joins, nested selects, and applies pile up.
My "Solution"
My plan was to write all of these records to one "catch-all" table, and write triggers on the original tables to maintain the records in this aggregate table. Of course I'd need to ensure my triggers were intact after updates, but this would be much easier from a maintainability standpoint, and just referencing the data.
The table would be thin and long, storing only the data required, something like this:
CREATE TABLE dbo.HCM_Event_Log (
id INT IDENTITY,
type_id INT NULL,
orig_id VARCHAR(36) NULL,
patient_id UNIQUEIDENTIFIER NOT NULL,
visit_id UNIQUEIDENTIFIER NULL,
lookup_id VARCHAR(50) NULL,
status VARCHAR(15) NULL,
ordered_datetime DATETIME NULL,
completed_datetime DATETIME NULL,
CONSTRAINT PK_HCM_Event_Log PRIMARY KEY CLUSTERED (id)
)Then I'd have various relational tables for things like the type_id and item groupings.
I'm starting to second guess this idea as several of these tables are written to quite a bit, the SPs and reports I'd be writing would reference the data a lot as well.
Solution
If I understood you correctly,
I would approach it like this:
In this case you can fine-tune structure and indexes of your database to improve performance of your reports, without affecting third-party system.
Unless the original data structure changes dramatically, the logic of your queries for your reports would not change if third-party database changes. You would have to adjust only the sync process.
The sync process is effectively the conversion process - you convert data from third-party database into the structure that you need. Part of this conversion process could be fixing any normalization problems that original third-party database may have. Only this part of the system has to know and depend on the internal structure of third-party system. Your main reports and main queries would depend only on your database.
So, the main point is - separate and limit the part of your system that depends on internals of third-party system.
update
Regarding real-time requirement. BTW, I always thought that definition of "real-time" is "guaranteed response time", not "some small response time". It depends on your application, of course. In my practice it is enough if I sync two databases within a minute of detected change. If a user sees a report on screen and some underlying data changes, the report has to be somehow re-run to reflect this change. You can poll for changes or listen to some event/message, still the report query has to be executed again to show the latest changes.
You already intend to write triggers to capture changes in the original tables and write these changes to one generic table. So, capture changes as you intended, but write them to properly normalized tables, not a single one.
So, this is extreme case - conversion of third-party data structure into your internal data structure is performed in the triggers that fire on
Less extreme case. To make the code of your triggers simpler and less prone to errors write all captured changes to some staging/audit/diff table(s), set some flag/send a message that there are changes pending and launch the main conversion process that would go through these intermediary tables and perform conversion. The main thing here is that potentially heavy conversion process should happen outside the scope of the original transaction.
At a second glance it looks pretty much like your original suggestion in the question. But, the difference is: the capture-all tables hold data only temporarily; the amount of data is small - just what has changed; it doesn't have to be a single table; eventually the data will be stored in separate properly normalized permanent tables, which you have full control of, which are independent from third-party system and which you can tune for your queries.
- you have a large third-party system,
- you don't have much control over it,
- you make complex reports that read data directly from this third-party database,
- your queries depend on the internal structure of the third-party database.
I would approach it like this:
- Set up my own separate database, which I have full control of.
- Set up a sync process that reads data from relevant tables and columns from the third-party database and inserts/updates into mine.
- Develop my complex reports based on the stable structure of my database.
In this case you can fine-tune structure and indexes of your database to improve performance of your reports, without affecting third-party system.
Unless the original data structure changes dramatically, the logic of your queries for your reports would not change if third-party database changes. You would have to adjust only the sync process.
The sync process is effectively the conversion process - you convert data from third-party database into the structure that you need. Part of this conversion process could be fixing any normalization problems that original third-party database may have. Only this part of the system has to know and depend on the internal structure of third-party system. Your main reports and main queries would depend only on your database.
So, the main point is - separate and limit the part of your system that depends on internals of third-party system.
update
Regarding real-time requirement. BTW, I always thought that definition of "real-time" is "guaranteed response time", not "some small response time". It depends on your application, of course. In my practice it is enough if I sync two databases within a minute of detected change. If a user sees a report on screen and some underlying data changes, the report has to be somehow re-run to reflect this change. You can poll for changes or listen to some event/message, still the report query has to be executed again to show the latest changes.
You already intend to write triggers to capture changes in the original tables and write these changes to one generic table. So, capture changes as you intended, but write them to properly normalized tables, not a single one.
So, this is extreme case - conversion of third-party data structure into your internal data structure is performed in the triggers that fire on
INSERT/UPDATE/DELETE of third-party tables. It can be tricky. The code of triggers would depend on internal structure of both systems. If conversion is non-trivial it may delay original INSERT/UPDATE/DELETE to the point of their failure. If there is a bug in your trigger it may affect original transaction to the point of their failure. If third-party system changes it may break your trigger, which would cause transactions of third-party system to fail.Less extreme case. To make the code of your triggers simpler and less prone to errors write all captured changes to some staging/audit/diff table(s), set some flag/send a message that there are changes pending and launch the main conversion process that would go through these intermediary tables and perform conversion. The main thing here is that potentially heavy conversion process should happen outside the scope of the original transaction.
At a second glance it looks pretty much like your original suggestion in the question. But, the difference is: the capture-all tables hold data only temporarily; the amount of data is small - just what has changed; it doesn't have to be a single table; eventually the data will be stored in separate properly normalized permanent tables, which you have full control of, which are independent from third-party system and which you can tune for your queries.
Context
StackExchange Database Administrators Q#110993, answer score: 8
Revisions (0)
No revisions yet.