HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Most efficient queries on flags in TSQL: bit field or table join?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
tsqlbitfieldtableefficientjoinflagsqueriesmost

Problem

Looking for the most efficient database structure for selecting rows from a large table based on one or more true/false values. For example, whether or not a user has opted in to marketing communications.

The use case here is finding user data for all users who have opted in. The Users table contains millions of rows of contact information.

Idea 1: Users table contains an indexed bit field

CREATE TABLE Users
  UserID INT PK
  ...
  OptInFlag BIT


Updating a user's OptIn preference would then just involve setting the OptInFlag to 1 or 0.

Selecting data for all users who have opted in might benefit from the index, depending on opt-in distribution:

SELECT UserID, ... FROM Users WHERE OptInFlag=1


Idea 2: Separate OptIn table that holds FK references to the UserIDs that have opted in

CREATE TABLE Users
  UserID INT PK
  ...

CREATE TABLE OptIn
  OptInID INT PK
  UserID INT FK


Updating a user's OptIn preference would involve INSERTing or DELETEing from the OptIn table.
Selecting data for all users who have opted in might benefit from the pre-filtering of the JOIN:

SELECT u.UserID, ... FROM OptIn o JOIN Users u ON o.UserID = u.UserID


Idea 2 looks like it would be the most efficient at first blush, since it effectively pre-filters the Users table before the join, but is that actually the case and are there any potential gotchas with this implementation? For instance, I expect that selecting opted out users would be slower than in Idea 1, but that is OK in this scenario.

Is there an alternative better than either idea 1 or 2?

Edit Ran some tests: Statistics shows the Total Time for idea 1 (using a filtered index) and idea 2 to be about the same. However, the filtered index uses 1/4 the CPU time of the other, showing the winner clearly.

Solution

It is very dependent on the makeup of the data, how the data changes over time and what queries are run as to what the most efficient method is.

I would suggest that a filtered index is the solution that is going to be the most efficient without talking about a specific scenario.

Edit:

You should add a filtered index filtered on where optIn = 1 with the fields that are required by the query of consequence as it will use this index to generate the result and not have to refer to the main table at all. To check that it reads from just the index you can analyse the execution plan. I would expect that the query of consequence would not require all of the fields for the user record and thus the fields in the index would be a subset of the fields in the actual user record.

In short the goal is to do a little extra work on inserting the record (updating the indexes) and storing extra data so that the query has to read a significantly smaller amount of data that is in continuous blocks on the disk (assuming we are talking about traditional disk technology) than it would if it had to scan through the whole table picking about the records and fields in those records required by the query.

Context

StackExchange Database Administrators Q#59056, answer score: 2

Revisions (0)

No revisions yet.