HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

How to capture missing information in SQL Server tables

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
tableshowsqlcapturemissingserverinformation

Problem

I have a few tables with nullable columns in a SQL Server database. Those nullable attributes are either not supplied at the time of row creation or they are not applicable in some cases.

I realized that I cannot use the same NULL value to mean 'Not Supplied' in some cases and 'Not Applicable' in other cases. These optional columns are of type int, datetime, and string. Even if I mark string columns as mandatory (NOT NULL) and set the attributes to either 'Not Supplied' and 'Not applicable' as appropriate, what can I do for int and datetime columns? I read chapter 23 from the book:

Database Explorations: Essays on The Third Manifesto by C.J. Data and Hugh Darwen

Looks like the column Status only has numeric values but since the authors have set 'n/a' and 'd/k' (don't know), not sure if they created Status as a string column. Is it a good idea to create all the optional columns as string columns and capture the reasons in the attributes when they don't have a value?

Solution

Retain the correct data types in each column. Use what is referred to as a magic number to indicate "not applicable", or "unknown".

For instance, with an int column, you might use -2147483647 to indicate "not applicable", and -2147483646 to indicate "unknown", presuming those two values do not show up in the "normal" data. For dates, use values like N'1900-01-01T00:00:00' and N'1900-01-01T00:00:01'. Of course, you'll need to programmatically exclude these values when performing statistical analysis or reporting, since they would skew things like averages and sums.

An example table containing an int column using magic numbers:

CREATE TABLE dbo.Items
(
    SomeValue int NOT NULL
    , CONSTRAINT CK_SomeValue
        CHECK (
            SomeValue = -2147483647 /* Not Applicable */
            OR SomeValue = -2147483646 /* Unknown */
            OR (SomeValue > 0 AND SomeValue < 2147483648)
            )
);


Note, the check constraint that "documents" the magic/canary values. As Erik Darling details in a blog post, canary values remove the need for the column to allow NULLs, which in turn can alleviate some nasty surprises when using NOT IN (...) and JOIN ON ISNULL(...) = ISNULL(...), etc.

An example may be the best way to show the problems associated with using magic/canary values, and using NULL values instead. We'll do this work in tempdb, and create two almost identical tables. One uses magic/canary values, one uses NULL instead.

USE tempdb; 

IF OBJECT_ID(N'dbo.ItemsNoNulls', N'U') IS NOT NULL
DROP TABLE dbo.ItemsNoNulls;
IF OBJECT_ID(N'dbo.ItemsNulls', N'U') IS NOT NULL
DROP TABLE dbo.ItemsNulls;

CREATE TABLE dbo.ItemsNoNulls
(
    SomeValue int NOT NULL
    , CONSTRAINT CK_ItemsNoNulls_SomeValue
        CHECK (
            SomeValue = -2147483647 /* Unknown */
            OR SomeValue = -2147483646 /* Not Applicable */
            OR (SomeValue > 0 AND SomeValue  0 AND SomeValue < 2147483648)
            )
);


Here, we'll insert a load of rows where 2% of the rows are "magic" values of either -2147483647 or -2147483646:

;WITH src AS (
    SELECT val = ABS(CONVERT(int, CRYPT_GEN_RANDOM(4)))
    FROM sys.columns c1
        CROSS JOIN sys.columns c2
)
INSERT INTO dbo.ItemsNoNulls(SomeValue)
SELECT CASE src.val % 50 
    WHEN 1 THEN -2147483647
    WHEN 2 THEN -2147483646
    ELSE src.val
END
FROM src;


This will insert those previously generated rows into the table where NULL represents Unknown or Not Available:

INSERT INTO dbo.ItemsNulls (SomeValue)
SELECT CASE inn.SomeValue
        WHEN -2147483647 THEN NULL
        WHEN -2147483646 THEN NULL
        ELSE inn.SomeValue
    END
FROM dbo.ItemsNoNulls inn


Compare the output of these two simple queries:

SELECT [Avg] = AVG(CONVERT(bigint, SomeValue))
    , [Count] = COUNT(SomeValue)
FROM dbo.ItemsNoNulls;


╔═══════════╦════════╗
║ Avg ║ Count ║
╠═══════════╬════════╣
║ 944303347 ║ 549081 ║
╚═══════════╩════════╝

SELECT [Avg] = AVG(CONVERT(bigint, SomeValue))
    , [Count] = COUNT(SomeValue)
    , [CountNull] = COUNT(COALESCE(SomeValue, 0))
FROM dbo.ItemsNulls


╔════════════╦════════╦═══════════╗
║ Avg ║ Count ║ CountNull ║
╠════════════╬════════╬═══════════╣
║ 1073162998 ║ 527112 ║ 549081 ║
╚════════════╩════════╩═══════════╝

The "average" in the first resultset is artificially low because of the presence of the canary values. The "count" in the second query is lower than expected since aggregates eliminate NULL values, as shown by the warning:


Warning: Null value is eliminated by an aggregate or other SET operation.

If we modify the first query to eliminate the canary values:

SELECT [Avg] = AVG(CONVERT(bigint, SomeValue))
    , [Count] = COUNT(SomeValue)
FROM dbo.ItemsNoNulls
WHERE SomeValue <> -2147483647
    AND SomeValue <> -2147483646;


We get the correct average value, however the count no longer reflects the actual number of rows in the table:

╔════════════╦════════╗
║ Avg ║ Count ║
╠════════════╬════════╣
║ 1073162998 ║ 527112 ║
╚════════════╩════════╝

To get the accurate average, and an accurate count, we need to do something like:

SELECT [Avg] = AVG(CONVERT(bigint, SomeValue))
    , [Count] = (SELECT COUNT(SomeValue) FROM dbo.ItemsNoNulls)
FROM dbo.ItemsNoNulls
WHERE SomeValue <> -2147483647
    AND SomeValue <> -2147483646;


╔════════════╦════════╗
║ Avg ║ Count ║
╠════════════╬════════╣
║ 1073162998 ║ 549081 ║
╚════════════╩════════╝

As you can see, both NULL columns and magic/canary values introduce some unexpected problems that can be bad for reporting. Making sure you and other team members fully understand the implications is extremely important for accuracy.

One other method of recording "unknown" or "not applicable" values is to allow NULL values on those columns, combined with a separate column that indicates the status of the value. For example:

```
USE tempdb;
IF OBJECT_ID(N'dbo.Items', N'U'

Code Snippets

CREATE TABLE dbo.Items
(
    SomeValue int NOT NULL
    , CONSTRAINT CK_SomeValue
        CHECK (
            SomeValue = -2147483647 /* Not Applicable */
            OR SomeValue = -2147483646 /* Unknown */
            OR (SomeValue > 0 AND SomeValue < 2147483648)
            )
);
USE tempdb; 

IF OBJECT_ID(N'dbo.ItemsNoNulls', N'U') IS NOT NULL
DROP TABLE dbo.ItemsNoNulls;
IF OBJECT_ID(N'dbo.ItemsNulls', N'U') IS NOT NULL
DROP TABLE dbo.ItemsNulls;

CREATE TABLE dbo.ItemsNoNulls
(
    SomeValue int NOT NULL
    , CONSTRAINT CK_ItemsNoNulls_SomeValue
        CHECK (
            SomeValue = -2147483647 /* Unknown */
            OR SomeValue = -2147483646 /* Not Applicable */
            OR (SomeValue > 0 AND SomeValue < 2147483648)
            )
);

CREATE TABLE dbo.ItemsNulls
(
    SomeValue int NULL
    , CONSTRAINT CK_ItemsNulls_SomeValue
        CHECK (
            SomeValue IS NULL
            OR (SomeValue > 0 AND SomeValue < 2147483648)
            )
);
;WITH src AS (
    SELECT val = ABS(CONVERT(int, CRYPT_GEN_RANDOM(4)))
    FROM sys.columns c1
        CROSS JOIN sys.columns c2
)
INSERT INTO dbo.ItemsNoNulls(SomeValue)
SELECT CASE src.val % 50 
    WHEN 1 THEN -2147483647
    WHEN 2 THEN -2147483646
    ELSE src.val
END
FROM src;
INSERT INTO dbo.ItemsNulls (SomeValue)
SELECT CASE inn.SomeValue
        WHEN -2147483647 THEN NULL
        WHEN -2147483646 THEN NULL
        ELSE inn.SomeValue
    END
FROM dbo.ItemsNoNulls inn
SELECT [Avg] = AVG(CONVERT(bigint, SomeValue))
    , [Count] = COUNT(SomeValue)
FROM dbo.ItemsNoNulls;

Context

StackExchange Database Administrators Q#187435, answer score: 5

Revisions (0)

No revisions yet.