patternsqlMinor
Find rows in a single table where values in a certain column are near-duplicates
Viewed 0 times
rowscolumnarewhereduplicatessinglenearfindvaluescertain
Problem
I have a Postgres 11 table on RDS containing a column
To be clear, none of the present rows are true duplicates, nor share precisely the same values in that column. My objective is to identify these records [and, once found, eliminate/merge the de-facto duplicates].
My initial inclination was to use a self-join, e.g.:
However, this returns all the email records rather than only those that are de-facto-duplications.
Using a subquery such as the following produces a similar [i.e., too-inclusive] result:
Since I’m not looking for an aggregate, but rather a case-insensitive comparison, it seems to me that a window function is not the correct approach.
I think that this should be a straightforward query, but I’m having a difficult time seeing it clearly and am sure there is an error in the way I’m conceiving of the problem; it’s pretty frustrating.
In addition to searching here and on SO, I consulted Molinaro’s SQL Cookbook, but to no avail.
What is the correct way to structure the query so that it returns only those records whose
edit note : my initial question formulation expressed a misguided inclination to use ILIKE for case-insensitive matching, but the use of lower() as suggested in the below answers is far more sensible
email; some of the values in this column [and only that column] are clearly de facto duplicates but differ in case, i.e., different capitalization, such as:foo@****.com
Foo@****.comTo be clear, none of the present rows are true duplicates, nor share precisely the same values in that column. My objective is to identify these records [and, once found, eliminate/merge the de-facto duplicates].
My initial inclination was to use a self-join, e.g.:
SELECT c.email
FROM schema.table c
INNER JOIN schema.table d ON lower(c.email) = lower(d.email)
ORDER BY c.email;However, this returns all the email records rather than only those that are de-facto-duplications.
Using a subquery such as the following produces a similar [i.e., too-inclusive] result:
SELECT c.email, alias.email
FROM schema.table c
JOIN (SELECT email FROM schema.table) alias ON lower(c.email) = lower(alias.email);Since I’m not looking for an aggregate, but rather a case-insensitive comparison, it seems to me that a window function is not the correct approach.
I think that this should be a straightforward query, but I’m having a difficult time seeing it clearly and am sure there is an error in the way I’m conceiving of the problem; it’s pretty frustrating.
In addition to searching here and on SO, I consulted Molinaro’s SQL Cookbook, but to no avail.
What is the correct way to structure the query so that it returns only those records whose
email values are the same, disregarding case? edit note : my initial question formulation expressed a misguided inclination to use ILIKE for case-insensitive matching, but the use of lower() as suggested in the below answers is far more sensible
Solution
Another case for
If you have a PK, use it instead of
This returns every qualifying row once. The added
Why not use a simple join?
If you have, say, 10 variants of the same email, a simple join would give you 10 over 2 = 90 rows, and repeat every combination with reversed roles. Basically a limited Carthesian product of all dupes for each set of dupes.
Related:
A trigram index as suggested there should greatly help performance with tables of non-trivial size.
Also note that
EXISTS:SELECT *
FROM schema.table t
WHERE EXISTS (
SELECT FROM schema.table t1
WHERE lower(t.email) = lower(t1.email)
AND t.ctid <> t1.ctid
)
ORDER BY lower(email), email;If you have a PK, use it instead of
ctid. Related:- How do I remove duplicate records in a join table in PostgreSQL?
This returns every qualifying row once. The added
ORDER BY helps to keep dupes together and in deterministic sort order (unless your locale is case insensitive).Why not use a simple join?
If you have, say, 10 variants of the same email, a simple join would give you 10 over 2 = 90 rows, and repeat every combination with reversed roles. Basically a limited Carthesian product of all dupes for each set of dupes.
Related:
- LOWER LIKE vs iLIKE
A trigram index as suggested there should greatly help performance with tables of non-trivial size.
Also note that
lower(t.email) = lower(t1.email) is slightly different from t.email ILIKE t1.email. The latter treats the right side as pattern, where some characters have special meaning unless you escape them. See:- Escape function for regular expression or LIKE patterns
Code Snippets
SELECT *
FROM schema.table t
WHERE EXISTS (
SELECT FROM schema.table t1
WHERE lower(t.email) = lower(t1.email)
AND t.ctid <> t1.ctid
)
ORDER BY lower(email), email;Context
StackExchange Database Administrators Q#254722, answer score: 3
Revisions (0)
No revisions yet.