snippetsqlModerate
UPDATE with JOIN on 100mm records, how to do this better? (in T-SQL)
Viewed 0 times
thisupdatewithsqlrecordsjoinbetterhow100mm
Problem
I need to update 100 million records in a single table, in effect, normalizing the table by replacing the varchar value of a column with simply an ID. (I say "replacing" but really I'm writing the ID into another column.)
What I'm trying to achieve is to normalize the dataset. The not yet normalized data has no indexing. My thought was that I would not build indexes on the raw values, waiting, instead to index the foreign keys that will be replacing the varchar values with tinyint values after the update completes.
Background
Resources During Execution
I suspect a few things like I need an index on the raw data, even though I will be dropping the column (AutoClassName) after the normalization updates. I also wonder if I should just loop down the table one record at a time instead of the JOIN, which seemed ridiculous at the time I started this, but now it seems that would have been faster.
How should I change my methodology for my remaining normalization updates (similar to this one) more quickly?
What I'm trying to achieve is to normalize the dataset. The not yet normalized data has no indexing. My thought was that I would not build indexes on the raw values, waiting, instead to index the foreign keys that will be replacing the varchar values with tinyint values after the update completes.
UPDATE A
SET A.AutoClassID = B.AutoClassID
FROM AutoDataImportStaging.dbo.Automobile as A
JOIN AutoData.dbo.AutoClass as B on (A.AutoClassName = B.AutoClassName)Background
- using MSSQL 2008 R2 on Server 2008 R2
- server has 8 GB RAM
- server has one RAID10, 7200 RPM SATA (not great, I know, in production this will only read data and not write data; plus recent HD shortage made this necessary for cost)
- server has dual quad-core Xeon CPU
- the machine is not doing anything else (currently dedicated to dev, only this process)
- simple logging turned on (? - but does it still log so that it can rollback?)
- note that the query references two different DBs, for what that's worth
- "width" of a record in table getting updated is 455 bytes
Resources During Execution
- physical RAM is maxed out
- disk I/O is maxed out
- CPU is hardly doing anything (choke point is I/O)
- run time has been 14 hours and counting!
I suspect a few things like I need an index on the raw data, even though I will be dropping the column (AutoClassName) after the normalization updates. I also wonder if I should just loop down the table one record at a time instead of the JOIN, which seemed ridiculous at the time I started this, but now it seems that would have been faster.
How should I change my methodology for my remaining normalization updates (similar to this one) more quickly?
Solution
I would take a different approach.
Instead of updating existing tables, just build a new table that has what you need in it.
This will almost certainly be faster:
As currently written, there are a lot of logical operations happening:
Instead of updating existing tables, just build a new table that has what you need in it.
This will almost certainly be faster:
SELECT DISTINCT
AutoClassID,
INTO
AutoDataImportStaging.dbo.Automobile
FROM
AutoData.dbo.AutoClassAs currently written, there are a lot of logical operations happening:
- Read all values of A.AutoClassName
- Read all values of B.AutoClassName
- Compare A and B values
- Of the matching set, read all values of B.AutoClassID
- Update existing values of A.AutoClassId to be the B.AutoClassId value through whatever indexes exist
Code Snippets
SELECT DISTINCT
AutoClassID,
<Other fields>
INTO
AutoDataImportStaging.dbo.Automobile
FROM
AutoData.dbo.AutoClassContext
StackExchange Database Administrators Q#20499, answer score: 10
Revisions (0)
No revisions yet.