patternsqlMinor
Find changed columns for near-identical rows across tables
Viewed 0 times
rowsacrosstablescolumnsidenticalforfindchangednear
Problem
I run PostgreSQL 9.3.5 on Windows 7, 64-bit.
My data arrives quarterly, in multiple tables (
Each period, in each table, some new
This selects all rows which are either new (new
About 96% of each table remains unchanged in every respect. Accordingly, in analysing the cross-period changes I build a table that only includes changed and new data. This reduces the table size from ~3.5m rows to ~225k rows: that's a BIG reduction if you subsequently do spatial comparisons with relatively-complex polygons and multiple (spatial and non-spatial)
The
``
My data arrives quarterly, in multiple tables (
table1, ..., tableN) that are linked, intra-period, by cross-table constraints based on key identifiers. Among other columns, each table has identifiers that persist over time: pfi - persistent feature identifier and ufi - universal feature identifier.pfi is unique per table (it's exceedingly rare that table1.pfi = table2.pfi.ufi is unique across all tables and across all time. It's not a hash of the row data, but you could think of it as such.Each period, in each table, some new
pfiare brought into being and some old pfi are retired. Some pfi change attributes. ufi tracks any change to any attribute for a given pfi(row), so to fetch changed (and new) rows for table1 it's simply a matter of:-- 1st query
select a.*
into vm201512.property_d
from vm201512.property a
where not exists (select 1 from vm201412.property where ufi = a.ufi);This selects all rows which are either new (new
pfi) or changed in at least one column.About 96% of each table remains unchanged in every respect. Accordingly, in analysing the cross-period changes I build a table that only includes changed and new data. This reduces the table size from ~3.5m rows to ~225k rows: that's a BIG reduction if you subsequently do spatial comparisons with relatively-complex polygons and multiple (spatial and non-spatial)
JOINs.The
property table has relatively few columns, so I can identify which elements of the data have changes as follows:``
-- 2nd query
create table vm201512.property_d_changes as
select pfi,
case when a.view_pfi=b.view_pfi then 0::int else 1::INT end as view_pfi,
case when a.status=b.status then 0::int else 1::INT end as status,
case when a.property_type=b.property_type then 0::int else 1::INT end as property_type,
-- ... more columns
from vm201512.property_d a -- table created with first query
join vm201412.property b using (pfi);
`Solution
Clarifications
Your comment needs addressing first:
numeric data almost always takes 0 (and text types take '')
The key word here is "almost". As long as it is not "never" (as in "never ever!"), you need to take NULL into account anyway.
no risk of testing
No it wouldn't. Anything compared to NULL is always NULL even
I think I just need to change
If you want to count every case of
You use
Postgres provides pivot functionality in the tablefunc module, but nothing is pivoted here.
The core problem is the dynamic nature of the query due to varying input tables.
Solution
Assuming no NULL values. Where NULL values are possible, use
Tested in Postgres 9.5. Should work for Postgres 9.1 or later.
You can build your queries like this:
Call:
Returns a query like this (which you can execute in turn):
Result:
Works for arbitrary input tables, and deals with identifiers safely. You can optionally schema-qualify passed table names. See:
Simple dynamic solution
The difficulty is to return varying row types. SQL demands to know the return type at call time. To avoid difficulties, you could return a simple array instead. You get values in the original order of columns, but you don't get column names like in the first query:
Call (note the difference!):
Result:
db<>fiddle here
Old sqlfiddle
You could even make the same function return a dynamic result set for various tables, but I doubt it's worth the complication:
If your really need dynamic pivot functionality (not in this case):
Your comment needs addressing first:
numeric data almost always takes 0 (and text types take '')
The key word here is "almost". As long as it is not "never" (as in "never ever!"), you need to take NULL into account anyway.
no risk of testing
NULL=NULL, which would return 1 inappropriatelyNo it wouldn't. Anything compared to NULL is always NULL even
NULL=NULL. Try it. You need to understand NULL comparison.I think I just need to change
sum(col1) to sum(col1::int) to get the number of rows where col1 changed.If you want to count every case of
a.col1 IS DISTINCT FROM b.col1, then you need to work with NULL-safe comparison to begin with. Apart from that, your expression would work. There are many alternatives, depending on the situation:- For absolute performance, is SUM faster or COUNT?
You use
select a.* into vm201512 ... in your 1st query. Don't. SELECT INTO ... is discouraged. Use the superior CREATE TABLE AS ... like in your 2nd query. See- Creating temporary tables in SQL
Postgres provides pivot functionality in the tablefunc module, but nothing is pivoted here.
The core problem is the dynamic nature of the query due to varying input tables.
Solution
Assuming no NULL values. Where NULL values are possible, use
IS NOT DISTINCT FROM instead of =.Tested in Postgres 9.5. Should work for Postgres 9.1 or later.
You can build your queries like this:
CREATE OR REPLACE FUNCTION f_build_query(_t1 regclass
, _t2 regclass
, _join_col text = 'pfi')
RETURNS text
LANGUAGE sql AS
$func$
SELECT format('SELECT %I, %s FROM %s a JOIN %s b USING (%1$I);'
, _join_col
, string_agg(format('a.%1$I = b.%1$I AS %1$I', attname), ', ' ORDER BY attnum)
, _t1, _t2)
FROM pg_attribute
WHERE attrelid = _t1 -- compare all columns from 1st table
AND NOT attisdropped -- no dropped (dead) columns
AND attnum > 0 -- no system columns
AND attname <> _join_col -- exclude 'pfi'
$func$;Call:
SELECT f_build_query('vm201512.property_d', 'vm201412.property');Returns a query like this (which you can execute in turn):
SELECT pfi, a.a = b.a AS a, a."weird NaMe" = b."weird NaMe" AS "weird NaMe" -- more ...
FROM vm201512.property_d a JOIN vm201412.property b USING (pfi);Result:
pfi | a | b | weird NaMe
-----+---+---+------------
1 | t | f | t
2 | f | t | f
Works for arbitrary input tables, and deals with identifiers safely. You can optionally schema-qualify passed table names. See:
- Table name as a PostgreSQL function parameter
Simple dynamic solution
The difficulty is to return varying row types. SQL demands to know the return type at call time. To avoid difficulties, you could return a simple array instead. You get values in the original order of columns, but you don't get column names like in the first query:
CREATE OR REPLACE FUNCTION f_diff_matrix(_t1 regclass
, _t2 regclass
, _join_col text = 'pfi')
RETURNS TABLE (pfi int, change_matrix bool[]) -- adapt type of pfi as needed
LANGUAGE plpgsql AS
$func$
BEGIN
RETURN QUERY EXECUTE (
SELECT format('SELECT %I, ARRAY[%s] FROM %s a JOIN %s b USING (%1$I)'
, _join_col
, string_agg(format('a.%1$I = b.%1$I', attname), ', ' ORDER BY attnum)
, _t1, _t2)
FROM pg_attribute
WHERE attrelid = _t1 -- compare all columns from 1st table
AND NOT attisdropped -- no dropped (dead) columns
AND attnum > 0 -- no system columns
AND attname <> _join_col -- exclude 'pfi'
);
END
$func$;Call (note the difference!):
SELECT * FROM f_diff_matrix('vm201512.property_d', 'vm201412.property');Result:
pfi | change_matrix
-----+---------------
1 | {t,f,t} -- one element per column
2 | {f,t,f}
db<>fiddle here
Old sqlfiddle
You could even make the same function return a dynamic result set for various tables, but I doubt it's worth the complication:
- Refactor a PL/pgSQL function to return the output of various SELECT queries
If your really need dynamic pivot functionality (not in this case):
- Dynamic alternative to pivot with CASE and GROUP BY
Code Snippets
CREATE OR REPLACE FUNCTION f_build_query(_t1 regclass
, _t2 regclass
, _join_col text = 'pfi')
RETURNS text
LANGUAGE sql AS
$func$
SELECT format('SELECT %I, %s FROM %s a JOIN %s b USING (%1$I);'
, _join_col
, string_agg(format('a.%1$I = b.%1$I AS %1$I', attname), ', ' ORDER BY attnum)
, _t1, _t2)
FROM pg_attribute
WHERE attrelid = _t1 -- compare all columns from 1st table
AND NOT attisdropped -- no dropped (dead) columns
AND attnum > 0 -- no system columns
AND attname <> _join_col -- exclude 'pfi'
$func$;SELECT f_build_query('vm201512.property_d', 'vm201412.property');SELECT pfi, a.a = b.a AS a, a."weird NaMe" = b."weird NaMe" AS "weird NaMe" -- more ...
FROM vm201512.property_d a JOIN vm201412.property b USING (pfi);CREATE OR REPLACE FUNCTION f_diff_matrix(_t1 regclass
, _t2 regclass
, _join_col text = 'pfi')
RETURNS TABLE (pfi int, change_matrix bool[]) -- adapt type of pfi as needed
LANGUAGE plpgsql AS
$func$
BEGIN
RETURN QUERY EXECUTE (
SELECT format('SELECT %I, ARRAY[%s] FROM %s a JOIN %s b USING (%1$I)'
, _join_col
, string_agg(format('a.%1$I = b.%1$I', attname), ', ' ORDER BY attnum)
, _t1, _t2)
FROM pg_attribute
WHERE attrelid = _t1 -- compare all columns from 1st table
AND NOT attisdropped -- no dropped (dead) columns
AND attnum > 0 -- no system columns
AND attname <> _join_col -- exclude 'pfi'
);
END
$func$;SELECT * FROM f_diff_matrix('vm201512.property_d', 'vm201412.property');Context
StackExchange Database Administrators Q#129265, answer score: 5
Revisions (0)
No revisions yet.