HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Quickest way to NULL or wipe a column

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
quickestcolumnnullwipeway

Problem

I have a unique situation, where I need to anonymize a column quickly. By that I mean any way to remove the data, whether it is NULL'ing, blanking or something else. There can be up to 20 million records to update and there are no indexes on the column I am updating.

I have tried a couple of things, for instance:

Update TABLE Set COLUMN = NULL


This clearly is the worst performing query. I have modified it to exclude any current NULLs or blanks from that set, but it is still very slow.

I have tried to drop and re-create the column, which so far is instant. But unfortunately because of business reasons, the column order is important, so this ruins that. (For the purpose of the question assume reordering columns is impossible).

Recently I tried to change the column type to char(1), then back to text - which had better performance. However, after seeing it run lightning fast with a drop and re-create column, I am curious if there is a way to do this with keeping the column order intact. Clearly SQL Server is capable of producing ~20 million records as NULLs instantly - there must be a clever way around this?

Solution

First let's do a quick review of how SQL Server generates the default ordering of columns when you do a SELECT * query. There are probably a few edge cases but I believe that columns are returned in the order that they are created. If a column is dropped the ordinal ID associated with that column is not used and cannot be used by a new column.

Consider this sample table:

CREATE TABLE dbo.SEE_COLUMN_ORDER (
COL1 INT,
COL2 INT,
COL3 INT,
COL4 INT
);


We can see the column_id using this query against sys.columns:

SELECT name, column_id
FROM sys.columns
WHERE object_id = OBJECT_ID('SEE_COLUMN_ORDER');


Initial results:

╔══════╦═══════════╗
║ name ║ column_id ║
╠══════╬═══════════╣
║ COL1 ║         1 ║
║ COL2 ║         2 ║
║ COL3 ║         3 ║
║ COL4 ║         4 ║
╚══════╩═══════════╝


Now drop a column:

ALTER TABLE dbo.SEE_COLUMN_ORDER DROP COLUMN COL3;


New results:

╔══════╦═══════════╗
║ name ║ column_id ║
╠══════╬═══════════╣
║ COL1 ║         1 ║
║ COL2 ║         2 ║
║ COL4 ║         4 ║
╚══════╩═══════════╝


Now add a column:

ALTER TABLE dbo.SEE_COLUMN_ORDER ADD COL3 INT;


New results:

╔══════╦═══════════╗
║ name ║ column_id ║
╠══════╬═══════════╣
║ COL1 ║         1 ║
║ COL2 ║         2 ║
║ COL4 ║         4 ║
║ COL3 ║         5 ║
╚══════╩═══════════╝


There is no column with a column_id of 3. As far as I can tell there is no known or supported way of making COL3 have a column_id of 3. If you wanted that column to be displayed third in a SELECT * query you could define a view against the table with the name and a different schema or you could drop and recreate the entire table with the column order that you want.

Recreating an entire table sounds like a slow operation but sometimes it can be quicker than updating all of the rows for a single column. It'll depend on your system, your table structure, and the data in your table. One scenario in which the drop and create might be faster is if your transaction log writes are the bottleneck. With a recovery model of simple you can create the new table with minimal logging which will write very little data to the transaction log compared to the UPDATE. Another scenario in which the drop and recreate can be faster is if the UPDATE will lead to a lot of page splits. It's possible to construct a table and an UPDATE which splits every data page into two (an UPDATE to a column value making it NULL won't do this as far as I know). As a general rule of thumb, the same amount of data is saved to the log regardless of the recovery model unless the operation can be minimally logged. UPDATE is never minimally logged so switching to simple won't reduce the transaction log requirements for that operation.

To cover the part of the question that deals with performance, one important thing to note is that adding and dropping a column to an existing table are optimized operations that have a fixed cost that doesn't scale with the amount of data in the table. To see this in action I'm going to look at how much data is logged in the transaction log for the operations (dropping and adding columns can be rolled back in a transaction). I'm testing against SQL Server 2016.

Here is the sample data:

DROP TABLE IF EXISTS dbo.X_COLUMN_WIPE_2;
CREATE TABLE dbo.X_COLUMN_WIPE_2 (
    ID INT NOT NULL IDENTITY (1, 1),
    COL_TO_WIPE VARCHAR(1) NULL,
    FILLER VARCHAR(100) NULL,
    PRIMARY KEY (ID)
);

-- 2536 rows
INSERT INTO dbo.X_COLUMN_WIPE_2 WITH (TABLOCK)
SELECT 'A', REPLICATE('Z', 100)
FROM master..spt_values t1;


Per sys.dm_tran_database_transactions, the UPDATE writes 324752 log bytes to the transaction log:

BEGIN TRANSACTION

UPDATE dbo.X_COLUMN_WIPE_2 SET COL_TO_WIPE = NULL; 

ROLLBACK;


Dropping and adding a column writes just 1992 log bytes to the log:

BEGIN TRANSACTION

ALTER TABLE dbo.X_COLUMN_WIPE_2 DROP COLUMN COL_TO_WIPE;

ALTER TABLE dbo.X_COLUMN_WIPE_2 ADD NEW_COLUMN VARCHAR(1) NULL;

ROLLBACK;


Now to test with more data:

DROP TABLE IF EXISTS dbo.X_COLUMN_WIPE_3;

CREATE TABLE dbo.X_COLUMN_WIPE_3 (
    ID INT NOT NULL IDENTITY (1, 1),
    COL_TO_WIPE VARCHAR(1) NULL,
    FILLER VARCHAR(100) NULL,
    PRIMARY KEY (ID)
);

-- 6431296 rows
INSERT INTO dbo.X_COLUMN_WIPE_3 WITH (TABLOCK)
SELECT 'A', REPLICATE('Z', 100)
FROM master..spt_values t1
CROSS JOIN master..spt_values t2;


The same UPDATE now writes 721979808 bytes to the transaction log but dropping and creating a column still writes just 1992 bytes.

It's possible to dig into the internals of why this happens using the undocumented DBCC PAGE. Here's an example from when I ran the code on my system (copying and pasting the code won't work because the page numbers will be different):

```
DROP TABLE IF EXISTS dbo.X_COLUMN_WIPE_4;
CREATE TABLE dbo.X_COLUMN_WIPE_4 (
ID INT NOT NULL IDENTITY (1, 1),
COL_TO_WIPE VARCHAR(1) NULL,
FILLER VARCHAR(100) NULL,
PRIMARY KEY (ID)
);

INSERT

Code Snippets

CREATE TABLE dbo.SEE_COLUMN_ORDER (
COL1 INT,
COL2 INT,
COL3 INT,
COL4 INT
);
SELECT name, column_id
FROM sys.columns
WHERE object_id = OBJECT_ID('SEE_COLUMN_ORDER');
╔══════╦═══════════╗
║ name ║ column_id ║
╠══════╬═══════════╣
║ COL1 ║         1 ║
║ COL2 ║         2 ║
║ COL3 ║         3 ║
║ COL4 ║         4 ║
╚══════╩═══════════╝
ALTER TABLE dbo.SEE_COLUMN_ORDER DROP COLUMN COL3;
╔══════╦═══════════╗
║ name ║ column_id ║
╠══════╬═══════════╣
║ COL1 ║         1 ║
║ COL2 ║         2 ║
║ COL4 ║         4 ║
╚══════╩═══════════╝

Context

StackExchange Database Administrators Q#173001, answer score: 13

Revisions (0)

No revisions yet.