patternphpMinor

More efficient method to check for duplicate database entries?

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

methodduplicatemoreefficientdatabaseforcheckentries

Problem

I am saving company information in a database, and if repeat information is found, I want to update the overlapping rows. Some of the supported fields are emails, phone numbers, organization names, and latitude/longitude positions. One of the obvious issues I've had to deal with, however, is knowing whether or not an organization is a duplicate of an already saved one. Consequently, I came up with some basic combinations of unique information that distinguishes one company from another:

phone number + organization_names

organization_names + location

email hash + website hash

phone + email hash

phone + website hash

email hash + location

email hash + organization name

website hash + location

website hash + organization_name

My workflow, as of now is as follows:

Determine if any of the above unique combinations are satisfied by the new information

Loop through each combination and create an MySQL query to check the database for duplicate entries

-
Use PHP's multi_query to run semicolon separated select statements for the above queries. These queries look like this:

```
SELECT id, jaro_winkler_similarity(normalized, "ORGANIZATION_NAME") AS organization_name_similarity0 FROM profiles LEFT JOIN phones USING(id) LEFT JOIN (SELECT id, normalized FROM organization_names WHERE normalized LIKE "%ORGANIZATION%" OR normalized LIKE "%NAME%" AS likeMatches USING(id) WHERE (number="6316870100" OR number="123") HAVING organization_name_similarity0 > 85;
SELECT id, jaro_winkler_similarity(normalized, "ORGANIZATION_NAME") AS organization_name_similarity0 FROM profiles LEFT JOIN (SELECT id, normalized FROM organization_names WHERE normalized LIKE "%ORGANIZATION%" OR normalized LIKE "%NAME%" AS likeMatches USING(id) LEFT JOIN locations USING(id) WHERE normalized_house_number="110" AND normalized_street_name="WASHINGTON" AND zip="00501" HAVING organization_name_similarity0 > 85;
SELECT id, jaro_winkler_similarity(normalized, "ORGANIZATION_NAME")

Solution

I would enforce uniqueness through unique indexes in the database table itself, not through making and evaluating a series of SELECT statements.

This allow you to GREATLY simplify your UPSERT operation to a single query. That might look something like.

/* Assume for this example there is a unique index on field_a and field_b */
INSERT INTO table (field_a, field_b, field_c) VALUES (1, 2, 3)
ON DUPLICATE KEY UPDATE field_c = 3

Now, the challenge here is that if your table has multiple unique indexes, you can get some unexpected behavior. So you probably need to reconsider your uniqueness criteria. Why would you check uniqueness of hashes of field combinations? This makes NO sense. Think in REAL WORLD terms about what makes a company listing unique. This probably doesn't include things like links to one or more websites on a related website table, or one or more phone numbers on a related phone number table, or one or more emails on a related email table, etc. All the fields you need to determine uniqueness of a company should all exist on the company table, so you can place a single unique index across them.

Code Snippets

/* Assume for this example there is a unique index on field_a and field_b */
INSERT INTO table (field_a, field_b, field_c) VALUES (1, 2, 3)
ON DUPLICATE KEY UPDATE field_c = 3

Context

StackExchange Code Review Q#132564, answer score: 8

Revisions (0)

No revisions yet.