snippetsqlMinor
How to design SQL schema that treats historical data as first-class citizen?
Viewed 0 times
treatsclasssqlcitizendesignfirstthathistoricalhowdata
Problem
I'm currently designing a SQL Server database schema for an application that consists of highly structured "tree" data. So, a table might look like this:
This is simple enough, but I also need to be able to update either the parentId of a node or the text of a node (or delete nodes), and access each revision of the tree at the same speed that the current revision can be retrieved. My current solution is to have two tables. First a revision table:
And then a modified node table:
So to get the tree for a specific revision, you take the revisionId and all revisionIds before it, then query the node table for all nodes with those revisionIds, then take only the highest revisionId nodes for nodes that share a UID, then delete any node that has the deleted flag set.
This works, but gets ugly fast (I'm using a simplified example) I actually have about 10 tables like this that need revision tracking, with each table having cross references to each other. This approach also breaks foreign keys, since UIDs are not unique. I cannot simply copy the full tree data for each revision, since there may be hundreds of revisions per hour.
What would the best practice be for a problem like this? I'd be willing to use a nonrelational database if it would be clearly superior.
id
parentId (references id in same table)
textThis is simple enough, but I also need to be able to update either the parentId of a node or the text of a node (or delete nodes), and access each revision of the tree at the same speed that the current revision can be retrieved. My current solution is to have two tables. First a revision table:
id
revisionDateAnd then a modified node table:
id
UID (same for all revisions of a specific node)
revisionId (references id of revision table)
deleted (flag for deleted node)
parentUID (references UID of parent node)
textSo to get the tree for a specific revision, you take the revisionId and all revisionIds before it, then query the node table for all nodes with those revisionIds, then take only the highest revisionId nodes for nodes that share a UID, then delete any node that has the deleted flag set.
This works, but gets ugly fast (I'm using a simplified example) I actually have about 10 tables like this that need revision tracking, with each table having cross references to each other. This approach also breaks foreign keys, since UIDs are not unique. I cannot simply copy the full tree data for each revision, since there may be hundreds of revisions per hour.
What would the best practice be for a problem like this? I'd be willing to use a nonrelational database if it would be clearly superior.
Solution
Trying to do FKs will just frustrate you - you don't actually need it.
If you handle versioning by using an InsertDateTime concept, then you're basically describing a Type2 dimension as used by many data warehouses. There is quite a lot of material out there about tuning systems that sit over the top of data warehouses, but from a T-SQL perspective, consider using APPLY. Like this:
This kind of construct benefits from an index on (IsDeleted, UID, InsertDateTime DESC).
Alternatively, if you want to look at the whole table as at a particular time, use ROW_NUMBER() like:
(edited from original to move the IsDeleted predicates outside the sub-queries, removing all versions at that particular time)
If you handle versioning by using an InsertDateTime concept, then you're basically describing a Type2 dimension as used by many data warehouses. There is quite a lot of material out there about tuning systems that sit over the top of data warehouses, but from a T-SQL perspective, consider using APPLY. Like this:
SELECT ...
FROM ... AS t
CROSS APPLY (
SELECT TOP (1) ...
FROM ... AS rev
WHERE rev.UID = t.UID
AND rev.InsertDateTime < @revdatetime
ORDER BY rev.InsertDateTime DESC) AS something
WHERE something.IsDeleted = 'N'This kind of construct benefits from an index on (IsDeleted, UID, InsertDateTime DESC).
Alternatively, if you want to look at the whole table as at a particular time, use ROW_NUMBER() like:
WITH numbered AS
( SELECT *, ROW_NUMBER() OVER (PARTITION BY UID ORDER BY InsertDateTime DESC) AS rownum
FROM ...
WHERE InsertDateTime < @revDateTime
),
VersionAtTime AS
( SELECT *
FROM numbered
WHERE rownum = 1
AND IsDeleted = 'N'
)
SELECT ...
FROM VersionAtTime
....(edited from original to move the IsDeleted predicates outside the sub-queries, removing all versions at that particular time)
Code Snippets
SELECT ...
FROM ... AS t
CROSS APPLY (
SELECT TOP (1) ...
FROM ... AS rev
WHERE rev.UID = t.UID
AND rev.InsertDateTime < @revdatetime
ORDER BY rev.InsertDateTime DESC) AS something
WHERE something.IsDeleted = 'N'WITH numbered AS
( SELECT *, ROW_NUMBER() OVER (PARTITION BY UID ORDER BY InsertDateTime DESC) AS rownum
FROM ...
WHERE InsertDateTime < @revDateTime
),
VersionAtTime AS
( SELECT *
FROM numbered
WHERE rownum = 1
AND IsDeleted = 'N'
)
SELECT ...
FROM VersionAtTime
....Context
StackExchange Database Administrators Q#114879, answer score: 3
Revisions (0)
No revisions yet.