patternsqlMinor
Remove extra email characters to enforce uniqueness
Viewed 0 times
enforceemailremovecharactersuniquenessextra
Problem
- Gmail ignores
.inside an email address e.gabc@gmail.comanda.b.c@gmail.comare same
- Gmail also considers same account if account has any number after
+symbol. e.gxyz@gmailandxyz+1@gmail.comare same
I want to put constraint for uniqueness on emails which can handle above case in postgres but not successful so far. Simple unique and saving in lower case doesn't work
Solution
To answer your question (+1 for an interesting, challenging and relevant question). I did the following:
Created a table:
Populated it with sample data (as per the question, the only place a
And then ran this query:
this gives us the domain from after the
I then use the SQL string concatenation operator (
So, my data now looks like this (
%%CODEBLOCK_7%%
So, the only dots and at signs remaining in the email string are valid, but we still have to get rid of the
So, the query now looks like this (
%%CODEBLOCK_8%%
This replaces all (
In fact, this part might be redundant. If there are no symbols other than
The LOWER() just tidies it up - I prefer emails in lower case!
Et voilà!), '[^a-zA-Z0-9@.]', '', 'g' ) ) AS "New email" FROM regexp_test;
Explanation of query below.
Result:
%%CODEBLOCK_3%%
All this is available on a fiddle here.
TL;DR - this will work for an email which contains multiple full stops (dots or periods) either in the first (local) part of the email (these will be removed as per the question)
In particular, emails like
To explain the query, it is best to go from the inside out (you can see some of the logical progression on the fiddle itself).
The first inner-most part uses SPLIT_PART() twice:
%%CODEBLOCK_4%%
Which gets (chomps) the email string up to the first
Then the outer
%%CODEBLOCK_5%%
removes (replaces
Now, I get the email domain with the SUBSTRING() function using a regular expression as explained here:
%%CODEBLOCK_6%%
this gives us the domain from after the
I then use the SQL string concatenation operator (
So, my data now looks like this (
%%CODEBLOCK_7%%
So, the only dots and at signs remaining in the email string are valid, but we still have to get rid of the
So, the query now looks like this (
%%CODEBLOCK_8%%
This replaces all (
In fact, this part might be redundant. If there are no symbols other than
The LOWER() just tidies it up - I prefer emails in lower case!
Et voilà!
Created a table:
CREATE TABLE regexp_test (my_email_string VARCHAR(255));Populated it with sample data (as per the question, the only place a
+sign can appear in the email address is at the end of the first (local) part with 1 or more digits between it and the @ sign:INSERT INTO regexp_test
VALUES
('as.dfWEWd.fs+4@stuff.com'),
('aS.Sdf.34343a.sfs@yahoo.com'),
('adsFFdsf"£"££$£$+15@tester.cn'), <-- non-standard characters!
('test_dots.asdf.+12345@blah.au.com')
-- multiple dots after the @ sign
-- this is the tricky part!And then ran this query:
SELECT
LOWER
(
REGEXP_REPLACE
(
REPLACE
(
SPLIT_PART(SPLIT_PART(my_email_string, '@', 1), '+', 1)
, '.', ''
)
||
'@'
||
SUBSTRING(my_email_string, '[^@]*
Explanation of query below.
Result:
New Email
asdfwewdfs@stuff.com
assdf34343asfs@yahoo.com
adsffdsf@tester.cn
testdotsasdf@blah.au.com
All this is available on a fiddle here.
TL;DR - this will work for an email which contains multiple full stops (dots or periods) either in the first (local) part of the email (these will be removed as per the question) OR in the second (domain) part - this part can legitimately have multiple dots - say blah.au.com. Any symbols which are not alphanumeric will be removed from the local part. Anything between a + and the @ sign in the local part will also be removed.
In particular, emails like john.m.doe+15@blah.au.com will be converted to johnmdoe@blah.au.com.
To explain the query, it is best to go from the inside out (you can see some of the logical progression on the fiddle itself).
The first inner-most part uses SPLIT_PART() twice:
SPLIT_PART(SPLIT_PART(my_email_string, '@', 1), '+', 1)
Which gets (chomps) the email string up to the first @ sign (and only @ signe in a valid email address). Then the second SPLIT_STRING chomps the string still further and removes anything from a + sign to the @ symbol inclusive.
Then the outer
REPLACE
(
SPLIT_PART(SPLIT_PART(my_email_string, '@', 1), '+', 1)
, '.', ''
)
removes (replaces '.' with '' - nothing) all the dots in the first (local) part of the string as per the question, but not in the second (domain) part (addresses such as @blah.com.au are valid).
Now, I get the email domain with the SUBSTRING() function using a regular expression as explained here:
SUBSTRING(my_email_string, '[^@]*
this gives us the domain from after the @ sign to the end of the string.
I then use the SQL string concatenation operator (|| - double pipe) to join the two strings and also reinsert the removed @ sign.
So, my data now looks like this (result_1):
asdfWEWdfs@stuff.com
aSSdf34343asfs@yahoo.com
adsFFdsf"£"££$£$@tester.cn
test_dotsasdf@blah.au.com
So, the only dots and at signs remaining in the email string are valid, but we still have to get rid of the + signs (from the question) and other symbols (for checking) from the string. For this we use REGEXP_REPLACE.
So, the query now looks like this (result_1 substituted in for clarity).
REGEXP_REPLACE(result_1, '[^a-zA-Z0-9@.]', '', 'g')
This replaces all ('g' for global) characters not (^) in a-z or A-Z or the @ symbol or a dot with nothing ('').
In fact, this part might be redundant. If there are no symbols other than .s (dots) anywhere in the local part of the string, or a + sign followed by digits and then an @ sign, then the REGEXP_REPLACEis unnecessary! I threw it in just in case!
The LOWER() just tidies it up - I prefer emails in lower case!
Et voilà!),
'[^a-zA-Z0-9@.]', '', 'g'
)
) AS "New email"
FROM regexp_test;
Explanation of query below.
Result:
%%CODEBLOCK_3%%
All this is available on a fiddle here.
TL;DR - this will work for an email which contains multiple full stops (dots or periods) either in the first (local) part of the email (these will be removed as per the question) OR in the second (domain) part - this part can legitimately have multiple dots - say blah.au.com. Any symbols which are not alphanumeric will be removed from the local part. Anything between a + and the @ sign in the local part will also be removed.
In particular, emails like john.m.doe+15@blah.au.com will be converted to johnmdoe@blah.au.com.
To explain the query, it is best to go from the inside out (you can see some of the logical progression on the fiddle itself).
The first inner-most part uses SPLIT_PART() twice:
%%CODEBLOCK_4%%
Which gets (chomps) the email string up to the first @ sign (and only @ signe in a valid email address). Then the second SPLIT_STRING chomps the string still further and removes anything from a + sign to the @ symbol inclusive.
Then the outer
%%CODEBLOCK_5%%
removes (replaces '.' with '' - nothing) all the dots in the first (local) part of the string as per the question, but not in the second (domain) part (addresses such as @blah.com.au are valid).
Now, I get the email domain with the SUBSTRING() function using a regular expression as explained here:
%%CODEBLOCK_6%%
this gives us the domain from after the @ sign to the end of the string.
I then use the SQL string concatenation operator (|| - double pipe) to join the two strings and also reinsert the removed @ sign.
So, my data now looks like this (result_1):
%%CODEBLOCK_7%%
So, the only dots and at signs remaining in the email string are valid, but we still have to get rid of the + signs (from the question) and other symbols (for checking) from the string. For this we use REGEXP_REPLACE.
So, the query now looks like this (result_1 substituted in for clarity).
%%CODEBLOCK_8%%
This replaces all ('g' for global) characters not (^) in a-z or A-Z or the @ symbol or a dot with nothing ('').
In fact, this part might be redundant. If there are no symbols other than .s (dots) anywhere in the local part of the string, or a + sign followed by digits and then an @ sign, then the REGEXP_REPLACEis unnecessary! I threw it in just in case!
The LOWER() just tidies it up - I prefer emails in lower case!
Et voilà!),this gives us the domain from after the
@ sign to the end of the string.I then use the SQL string concatenation operator (
|| - double pipe) to join the two strings and also reinsert the removed @ sign.So, my data now looks like this (
result_1):%%CODEBLOCK_7%%
So, the only dots and at signs remaining in the email string are valid, but we still have to get rid of the
+ signs (from the question) and other symbols (for checking) from the string. For this we use REGEXP_REPLACE.So, the query now looks like this (
result_1 substituted in for clarity).%%CODEBLOCK_8%%
This replaces all (
'g' for global) characters not (^) in a-z or A-Z or the @ symbol or a dot with nothing (''). In fact, this part might be redundant. If there are no symbols other than
.s (dots) anywhere in the local part of the string, or a + sign followed by digits and then an @ sign, then the REGEXP_REPLACEis unnecessary! I threw it in just in case!The LOWER() just tidies it up - I prefer emails in lower case!
Et voilà!), '[^a-zA-Z0-9@.]', '', 'g' ) ) AS "New email" FROM regexp_test;
Explanation of query below.
Result:
%%CODEBLOCK_3%%
All this is available on a fiddle here.
TL;DR - this will work for an email which contains multiple full stops (dots or periods) either in the first (local) part of the email (these will be removed as per the question)
OR in the second (domain) part - this part can legitimately have multiple dots - say blah.au.com. Any symbols which are not alphanumeric will be removed from the local part. Anything between a + and the @ sign in the local part will also be removed.In particular, emails like
john.m.doe+15@blah.au.com will be converted to johnmdoe@blah.au.com.To explain the query, it is best to go from the inside out (you can see some of the logical progression on the fiddle itself).
The first inner-most part uses SPLIT_PART() twice:
%%CODEBLOCK_4%%
Which gets (chomps) the email string up to the first
@ sign (and only @ signe in a valid email address). Then the second SPLIT_STRING chomps the string still further and removes anything from a + sign to the @ symbol inclusive.Then the outer
%%CODEBLOCK_5%%
removes (replaces
'.' with '' - nothing) all the dots in the first (local) part of the string as per the question, but not in the second (domain) part (addresses such as @blah.com.au are valid).Now, I get the email domain with the SUBSTRING() function using a regular expression as explained here:
%%CODEBLOCK_6%%
this gives us the domain from after the
@ sign to the end of the string.I then use the SQL string concatenation operator (
|| - double pipe) to join the two strings and also reinsert the removed @ sign.So, my data now looks like this (
result_1):%%CODEBLOCK_7%%
So, the only dots and at signs remaining in the email string are valid, but we still have to get rid of the
+ signs (from the question) and other symbols (for checking) from the string. For this we use REGEXP_REPLACE.So, the query now looks like this (
result_1 substituted in for clarity).%%CODEBLOCK_8%%
This replaces all (
'g' for global) characters not (^) in a-z or A-Z or the @ symbol or a dot with nothing (''). In fact, this part might be redundant. If there are no symbols other than
.s (dots) anywhere in the local part of the string, or a + sign followed by digits and then an @ sign, then the REGEXP_REPLACEis unnecessary! I threw it in just in case!The LOWER() just tidies it up - I prefer emails in lower case!
Et voilà!
Code Snippets
CREATE TABLE regexp_test (my_email_string VARCHAR(255));INSERT INTO regexp_test
VALUES
('as.dfWEWd.fs+4@stuff.com'),
('aS.Sdf.34343a.sfs@yahoo.com'),
('adsFFdsf"£"££$£$+15@tester.cn'), <-- non-standard characters!
('test_dots.asdf.+12345@blah.au.com')
-- multiple dots after the @ sign
-- this is the tricky part!SELECT
LOWER
(
REGEXP_REPLACE
(
REPLACE
(
SPLIT_PART(SPLIT_PART(my_email_string, '@', 1), '+', 1)
, '.', ''
)
||
'@'
||
SUBSTRING(my_email_string, '[^@]*$'),
'[^a-zA-Z0-9@.]', '', 'g'
)
) AS "New email"
FROM regexp_test;New Email
asdfwewdfs@stuff.com
assdf34343asfs@yahoo.com
adsffdsf@tester.cn
testdotsasdf@blah.au.comSPLIT_PART(SPLIT_PART(my_email_string, '@', 1), '+', 1)Context
StackExchange Database Administrators Q#208558, answer score: 4
Revisions (0)
No revisions yet.