patternsqlModerate
Quick way to validate two tables against each other
Viewed 0 times
validatetableseachwayagainsttwoquickother
Problem
We're doing an ETL process. When all is said and done there are a bunch of tables that should be identical. What is the quickest way to verify that those tables (on two different servers) are in fact identical. I'm talking both schema and data.
Can I do a hash on the table it's self like I would be able to on an individual file or filegroup - to compare one to the other. We have Red-Gate data compare but since the tables in question contain millions of rows each I'd like something a little more performant.
One approach that intrigues me is this creative use of the union statement. But, I'd like to explore the hash idea a little further if possible.
POST ANSWER UPDATE
For any future vistors... here is the exact approach I ended up taking. It worked so well we're doing it on every table in each database. Thanks to answers below for pointing me in the right direction.
```
CREATE PROCEDURE [dbo].[usp_DatabaseValidation]
@TableName varchar(50)
AS
BEGIN
SET NOCOUNT ON;
-- parameter = if no table name was passed do them all, otherwise just check the one
-- create a temp table that lists all tables in target database
CREATE TABLE #ChkSumTargetTables ([fullname] varchar(250), [name] varchar(50), chksum int);
INSERT INTO #ChkSumTargetTables ([fullname], [name], [chksum])
SELECT DISTINCT
'[MyDatabase].[' + S.name + '].['
+ T.name + ']' AS [fullname],
T.name AS [name],
0 AS [chksum]
FROM MyDatabase.sys.tables T
INNER JOIN MyDatabase.sys.schemas S ON T.schema_id = S.schema_id
WHERE
T.name like IsNull(@TableName,'%');
-- create a temp table that lists all tables in source database
CREATE TABLE #ChkSumSourceTables ([fullname] varchar(250), [name] varchar(50), chksum int)
INSERT INTO #ChkSumSourceTables ([fullname], [name], [chksum])
SELECT DISTINCT
'[MyLinkedServer].[MyDatabase].[' + S.name + '].['
Can I do a hash on the table it's self like I would be able to on an individual file or filegroup - to compare one to the other. We have Red-Gate data compare but since the tables in question contain millions of rows each I'd like something a little more performant.
One approach that intrigues me is this creative use of the union statement. But, I'd like to explore the hash idea a little further if possible.
POST ANSWER UPDATE
For any future vistors... here is the exact approach I ended up taking. It worked so well we're doing it on every table in each database. Thanks to answers below for pointing me in the right direction.
```
CREATE PROCEDURE [dbo].[usp_DatabaseValidation]
@TableName varchar(50)
AS
BEGIN
SET NOCOUNT ON;
-- parameter = if no table name was passed do them all, otherwise just check the one
-- create a temp table that lists all tables in target database
CREATE TABLE #ChkSumTargetTables ([fullname] varchar(250), [name] varchar(50), chksum int);
INSERT INTO #ChkSumTargetTables ([fullname], [name], [chksum])
SELECT DISTINCT
'[MyDatabase].[' + S.name + '].['
+ T.name + ']' AS [fullname],
T.name AS [name],
0 AS [chksum]
FROM MyDatabase.sys.tables T
INNER JOIN MyDatabase.sys.schemas S ON T.schema_id = S.schema_id
WHERE
T.name like IsNull(@TableName,'%');
-- create a temp table that lists all tables in source database
CREATE TABLE #ChkSumSourceTables ([fullname] varchar(250), [name] varchar(50), chksum int)
INSERT INTO #ChkSumSourceTables ([fullname], [name], [chksum])
SELECT DISTINCT
'[MyLinkedServer].[MyDatabase].[' + S.name + '].['
Solution
Here's what I've done before:
It's worked well enough on tables that are about 1,000,000 rows, but I'm not sure how well that would work on extremely large tables.
Added:
I've run the query against my system which compares two tables with 21 fields of regular types in two different databases attached to the same server running SQL Server 2005. The table has about 3 million rows, and there's about 25000 rows different. The primary key on the table is weird, however, as it's a composite key of 10 fields (it's an audit table).
The execution plans for the queries has a total cost of 184.25879 for
Actually executing either query takes about 42s plus about 3s to actually transmit the rows. The time between the two queries is identical.
Second Addition:
This is actually extremely fast, each one running against 3 million rows in about 2.5s:
If the results of those don't match, you know the tables are different. However, if the results do match, you're not guaranteed that the tables are identical because of the [highly unlikely] chance of checksum collisions.
I'm not sure how datatype changes between tables would affect this calculation. I would run the query against the
I tried the query against another table with 5 million rows and that one ran in about 5s, so it appears to be largely O(n).
(SELECT 'TableA', * FROM TableA
EXCEPT
SELECT 'TableA', * FROM TableB)
UNION ALL
(SELECT 'TableB', * FROM TableB
EXCEPT
SELECT 'TableB', * FROM TableA)It's worked well enough on tables that are about 1,000,000 rows, but I'm not sure how well that would work on extremely large tables.
Added:
I've run the query against my system which compares two tables with 21 fields of regular types in two different databases attached to the same server running SQL Server 2005. The table has about 3 million rows, and there's about 25000 rows different. The primary key on the table is weird, however, as it's a composite key of 10 fields (it's an audit table).
The execution plans for the queries has a total cost of 184.25879 for
UNION and 184.22983 for UNION ALL. The tree cost only differs on the last step before returning rows, the concatenation.Actually executing either query takes about 42s plus about 3s to actually transmit the rows. The time between the two queries is identical.
Second Addition:
This is actually extremely fast, each one running against 3 million rows in about 2.5s:
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM TableA
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM TableBIf the results of those don't match, you know the tables are different. However, if the results do match, you're not guaranteed that the tables are identical because of the [highly unlikely] chance of checksum collisions.
I'm not sure how datatype changes between tables would affect this calculation. I would run the query against the
system views or information_schema views.I tried the query against another table with 5 million rows and that one ran in about 5s, so it appears to be largely O(n).
Code Snippets
(SELECT 'TableA', * FROM TableA
EXCEPT
SELECT 'TableA', * FROM TableB)
UNION ALL
(SELECT 'TableB', * FROM TableB
EXCEPT
SELECT 'TableB', * FROM TableA)SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM TableA
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM TableBContext
StackExchange Database Administrators Q#34356, answer score: 19
Revisions (0)
No revisions yet.