HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Deciding between an IsDeleted/DateDeleted column or a separate History table?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
isdeletedcolumnhistoryseparatebetweendatedeleteddecidingtable

Problem

We have a table that stores nothing but an AccountId, TagId, and DateCreated. Users can update the Tags on accounts, which adds or remove records from this table. No tag can exist more than once on an account, but the same tag can be added and removed multiple times. This table is primarily used in queries users build to filter accounts using either JOIN or EXISTS on the tag table.

I've been asked to start recording tag history, and I'm trying to decide if it would be better to use an IsDeleted and/or DateDeleted column on the existing table, or if I should create a separate history table that stores this data instead.

The primary purpose of tracking the history would be to run reports on a specific point in time. For example, getting data on accounts that had TagA on them during January 2013.

What factors should I consider when trying to decide between the two?

Solution

The problem with IsDeleted is its low cadinality: 0 or 1. Which makes it unindexable and since all your queries will add IsDeleted = 0 to the WHERE (or JOIN) conditions this apparent trivial change complicates a lot of things. You covering indexes will all of the sudden require lookups into the base table to decide if the row qualifies or not. There are measures that can improve behavior, but they have to be properly considered and consistently deployed (note that some of these mitigation are mutually exclusive, evaluate them separately):

  • add IsDeleted as the leftmost key in the clustered index. This effectively 'splits' the clustered table into two regions ('deleted' and 'active'). It also forces de-facto the IsDeleted column in every non-clustered index, thus avoiding the need to lookup the value.



  • add IsDeleted as a leftmost key in every non-clustered index. May sound extreme, but can be quite effective, provided that all queries specify IsDeleted in the WHERE.



  • add IsDeleted as an included column to every non-clustered index. Unlike the options above this does not change seek/range/lookup semantics of the exiting indexes, but makes the IsDeleted check filtering available in every access path. Can be quite effective specially if IsDeleted = 1 occurs seldom.



  • make the non-clustered indexes filtered by IsDeleted = 0. May be effective on large tables, but is a bit dangerous as the filtered indexes become unusable for queries interested in the deleted rows.



A separate table for history makes a much more cleaner cut from the engine point of view: is a different table, no messy/risky plan choices to cope with. But it is much more impact on the application. If looking up the history of a row is the exception, the application changes are reasonable easy then I would consider this option first, as it has the least risk of (severe) performance regression.

I would strongly advise against mixed solutions (eg. tags and tags_history table and a view that unions them), such solutions have all the disadvantages mentioned above without any advantage. In general, anything that claims 'minimal impact' or 'transparent change' is snake oil. This (adding application layer row versioning/row history) is, by definition, a high impact change which will ripple big time in the application.

Context

StackExchange Database Administrators Q#34233, answer score: 9

Revisions (0)

No revisions yet.