snippetsqlMinor
How to prevent duplicate VARCHAR without a key limit?
Viewed 0 times
preventwithoutduplicatelimitvarcharhowkey
Problem
I'd like to store URLs in a database column, and enforce a constraint that values must be unique. Unfortunately, MySQL has a limit on the length of index keys which means that only the first X characters of the URL gets checked for uniqueness. Thus, I've run into false positives where two different URLs triggered a constraint integration violation because the first X characters just-so-happened to be identical.
Is there a way to enforce uniqueness on a VARCHAR column without any limit on its length?
Is it possible to, say, create a non-UNIQUE index over the first X characters and then have a trigger block INSERTs if the remaining characters are identical?
Is there a way to enforce uniqueness on a VARCHAR column without any limit on its length?
Is it possible to, say, create a non-UNIQUE index over the first X characters and then have a trigger block INSERTs if the remaining characters are identical?
Solution
We keep giving you answers that do not directly answer the question, because that is how we solve this problem. An index of unlimited length is impractical and inefficient, but a unique hash provides a solution that sufficient to the task because of the astronomically low likelihood of a meaningful collision.
Similar to the other offered solutions, my standard approach does not check for duplicates up front -- it is optimistic in that sense: it relies on constraint checking by the database, with the assumption that most inserts are not duplicates, so there's no point in wasting time trying to determine if they are.
Working, tested example (5.7.16, backwards compatible to 5.6; previous versions do not have a built-in
Note that I am storing the base64 version of the hash. This is a 4:3 size tradeoff compared to storing it in binary form because it makes the table contents and the error message human readable, and the inefficiency is partially offset by the table compression. The hash column has a unique constraint. The data type is
The url_hash is set by a trigger, below, but the trigger does not check for a collision -- there is no need to check, because of the unique constraint on url_hash. The database will block a duplicate insert.
Note that url_hash should have been declared
The url column has a prefix index length of 16, which was chosen arbitrarily. This isn't a unique constraint, just an index for lookups, and it is probably shorter than you might want it to be, but its length has no operational impact on the problem we are solving, here.
Here's the trigger to set the url_hash. We don't need to include this value in an
You need a trigger on update also, either to block updates if the table is supposed to be immutable, or to update the hash if the URL changes. We also need this trigger to ensure that the url_hash column can't be inappropriately set to
Now, to test.
So far, so good. Now, a different URL:
Still works. Now, a duplicate.
Perfect. If you want an even lower risk of hash collisions than MD5 provides, use a SHA variant, increasing the length of
Similar to the other offered solutions, my standard approach does not check for duplicates up front -- it is optimistic in that sense: it relies on constraint checking by the database, with the assumption that most inserts are not duplicates, so there's no point in wasting time trying to determine if they are.
Working, tested example (5.7.16, backwards compatible to 5.6; previous versions do not have a built-in
TO_BASE64() function):CREATE TABLE web_page (
id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
url LONGTEXT NOT NULL,
url_hash CHAR(24) COLLATE ascii_bin,
PRIMARY KEY(id),
UNIQUE KEY(url_hash),
KEY(url(16))
)ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED;Note that I am storing the base64 version of the hash. This is a 4:3 size tradeoff compared to storing it in binary form because it makes the table contents and the error message human readable, and the inefficiency is partially offset by the table compression. The hash column has a unique constraint. The data type is
CHAR, not VARCHAR, since this eliminates the byte needed to store the size -- the hash is always a fixed size. The column uses the ascii character set with ascii_bin (case-sensitive) collation, keeping the column and the unique index as small as practical.The url_hash is set by a trigger, below, but the trigger does not check for a collision -- there is no need to check, because of the unique constraint on url_hash. The database will block a duplicate insert.
Note that url_hash should have been declared
NOT NULL but MySQL incorrectly enforces this before the BEFORE INSERT trigger fires, instead of after, so we are limited by that. The trigger does prevent it from being null.The url column has a prefix index length of 16, which was chosen arbitrarily. This isn't a unique constraint, just an index for lookups, and it is probably shorter than you might want it to be, but its length has no operational impact on the problem we are solving, here.
Here's the trigger to set the url_hash. We don't need to include this value in an
INSERT statement when we insert rows. DELIMITER $
DROP TRIGGER IF EXISTS web_page_bi $
CREATE TRIGGER web_page_bi BEFORE INSERT ON web_page FOR EACH ROW
BEGIN
SET NEW.url_hash = TO_BASE64(UNHEX(MD5(NEW.url)));
END $
DELIMITER ;You need a trigger on update also, either to block updates if the table is supposed to be immutable, or to update the hash if the URL changes. We also need this trigger to ensure that the url_hash column can't be inappropriately set to
NULL since the limitation in MySQL doesn't allow us to actually declare it that way, as we should.Now, to test.
mysql> INSERT INTO web_page (url) VALUES ('http://example.com/');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM web_page;
+----+---------------------+--------------------------+
| id | url | url_hash |
+----+---------------------+--------------------------+
| 1 | http://example.com/ | pr8XV//wV/JmtpffnPF2/Q== |
+----+---------------------+--------------------------+
1 row in set (0.00 sec)So far, so good. Now, a different URL:
mysql> INSERT INTO web_page (url) VALUES ('http://example.net/');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM web_page;
+----+---------------------+--------------------------+
| id | url | url_hash |
+----+---------------------+--------------------------+
| 1 | http://example.com/ | pr8XV//wV/JmtpffnPF2/Q== |
| 2 | http://example.net/ | ZVk/eLfvBI6tHN0Luj3NnQ== |
+----+---------------------+--------------------------+
2 rows in set (0.00 sec)Still works. Now, a duplicate.
mysql> INSERT INTO web_page (url) VALUES ('http://example.com/');
ERROR 1062 (23000): Duplicate entry 'pr8XV//wV/JmtpffnPF2/Q==' for key 'url_hash'Perfect. If you want an even lower risk of hash collisions than MD5 provides, use a SHA variant, increasing the length of
data_hash to CHAR_LENGTH(TO_BASE64(UNHEX( / your hash function / ))) to accommodate the values generated by the hash algorithm in use.Code Snippets
CREATE TABLE web_page (
id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
url LONGTEXT NOT NULL,
url_hash CHAR(24) COLLATE ascii_bin,
PRIMARY KEY(id),
UNIQUE KEY(url_hash),
KEY(url(16))
)ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED;DELIMITER $$
DROP TRIGGER IF EXISTS web_page_bi $$
CREATE TRIGGER web_page_bi BEFORE INSERT ON web_page FOR EACH ROW
BEGIN
SET NEW.url_hash = TO_BASE64(UNHEX(MD5(NEW.url)));
END $$
DELIMITER ;mysql> INSERT INTO web_page (url) VALUES ('http://example.com/');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM web_page;
+----+---------------------+--------------------------+
| id | url | url_hash |
+----+---------------------+--------------------------+
| 1 | http://example.com/ | pr8XV//wV/JmtpffnPF2/Q== |
+----+---------------------+--------------------------+
1 row in set (0.00 sec)mysql> INSERT INTO web_page (url) VALUES ('http://example.net/');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM web_page;
+----+---------------------+--------------------------+
| id | url | url_hash |
+----+---------------------+--------------------------+
| 1 | http://example.com/ | pr8XV//wV/JmtpffnPF2/Q== |
| 2 | http://example.net/ | ZVk/eLfvBI6tHN0Luj3NnQ== |
+----+---------------------+--------------------------+
2 rows in set (0.00 sec)mysql> INSERT INTO web_page (url) VALUES ('http://example.com/');
ERROR 1062 (23000): Duplicate entry 'pr8XV//wV/JmtpffnPF2/Q==' for key 'url_hash'Context
StackExchange Database Administrators Q#157647, answer score: 5
Revisions (0)
No revisions yet.