patternphpMinor
More efficient method to check for duplicate database entries?
Viewed 0 times
methodduplicatemoreefficientdatabaseforcheckentries
Problem
I am saving company information in a database, and if repeat information is found, I want to update the overlapping rows. Some of the supported fields are emails, phone numbers, organization names, and latitude/longitude positions. One of the obvious issues I've had to deal with, however, is knowing whether or not an organization is a duplicate of an already saved one. Consequently, I came up with some basic combinations of unique information that distinguishes one company from another:
My workflow, as of now is as follows:
-
Use PHP's
```
SELECT id, jaro_winkler_similarity(normalized, "ORGANIZATION_NAME") AS organization_name_similarity0 FROM profiles LEFT JOIN phones USING(id) LEFT JOIN (SELECT id, normalized FROM organization_names WHERE normalized LIKE "%ORGANIZATION%" OR normalized LIKE "%NAME%" AS likeMatches USING(id) WHERE (number="6316870100" OR number="123") HAVING organization_name_similarity0 > 85;
SELECT id, jaro_winkler_similarity(normalized, "ORGANIZATION_NAME") AS organization_name_similarity0 FROM profiles LEFT JOIN (SELECT id, normalized FROM organization_names WHERE normalized LIKE "%ORGANIZATION%" OR normalized LIKE "%NAME%" AS likeMatches USING(id) LEFT JOIN locations USING(id) WHERE normalized_house_number="110" AND normalized_street_name="WASHINGTON" AND zip="00501" HAVING organization_name_similarity0 > 85;
SELECT id, jaro_winkler_similarity(normalized, "ORGANIZATION_NAME")
- phone number + organization_names
- organization_names + location
- email hash + website hash
- phone + email hash
- phone + website hash
- email hash + location
- email hash + organization name
- website hash + location
- website hash + organization_name
My workflow, as of now is as follows:
- Determine if any of the above unique combinations are satisfied by the new information
- Loop through each combination and create an MySQL query to check the database for duplicate entries
-
Use PHP's
multi_query to run semicolon separated select statements for the above queries. These queries look like this:```
SELECT id, jaro_winkler_similarity(normalized, "ORGANIZATION_NAME") AS organization_name_similarity0 FROM profiles LEFT JOIN phones USING(id) LEFT JOIN (SELECT id, normalized FROM organization_names WHERE normalized LIKE "%ORGANIZATION%" OR normalized LIKE "%NAME%" AS likeMatches USING(id) WHERE (number="6316870100" OR number="123") HAVING organization_name_similarity0 > 85;
SELECT id, jaro_winkler_similarity(normalized, "ORGANIZATION_NAME") AS organization_name_similarity0 FROM profiles LEFT JOIN (SELECT id, normalized FROM organization_names WHERE normalized LIKE "%ORGANIZATION%" OR normalized LIKE "%NAME%" AS likeMatches USING(id) LEFT JOIN locations USING(id) WHERE normalized_house_number="110" AND normalized_street_name="WASHINGTON" AND zip="00501" HAVING organization_name_similarity0 > 85;
SELECT id, jaro_winkler_similarity(normalized, "ORGANIZATION_NAME")
Solution
I would enforce uniqueness through unique indexes in the database table itself, not through making and evaluating a series of SELECT statements.
This allow you to GREATLY simplify your UPSERT operation to a single query. That might look something like.
Now, the challenge here is that if your table has multiple unique indexes, you can get some unexpected behavior. So you probably need to reconsider your uniqueness criteria. Why would you check uniqueness of hashes of field combinations? This makes NO sense. Think in REAL WORLD terms about what makes a company listing unique. This probably doesn't include things like links to one or more websites on a related website table, or one or more phone numbers on a related phone number table, or one or more emails on a related email table, etc. All the fields you need to determine uniqueness of a company should all exist on the company table, so you can place a single unique index across them.
This allow you to GREATLY simplify your UPSERT operation to a single query. That might look something like.
/* Assume for this example there is a unique index on field_a and field_b */
INSERT INTO table (field_a, field_b, field_c) VALUES (1, 2, 3)
ON DUPLICATE KEY UPDATE field_c = 3Now, the challenge here is that if your table has multiple unique indexes, you can get some unexpected behavior. So you probably need to reconsider your uniqueness criteria. Why would you check uniqueness of hashes of field combinations? This makes NO sense. Think in REAL WORLD terms about what makes a company listing unique. This probably doesn't include things like links to one or more websites on a related website table, or one or more phone numbers on a related phone number table, or one or more emails on a related email table, etc. All the fields you need to determine uniqueness of a company should all exist on the company table, so you can place a single unique index across them.
Code Snippets
/* Assume for this example there is a unique index on field_a and field_b */
INSERT INTO table (field_a, field_b, field_c) VALUES (1, 2, 3)
ON DUPLICATE KEY UPDATE field_c = 3Context
StackExchange Code Review Q#132564, answer score: 8
Revisions (0)
No revisions yet.