HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Separating Duplicates from Uniques

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
uniquesfromseparatingduplicates

Problem

I have a MySQL table with a very large data. What I need is to find and separate duplicates from the unique rows

Let's say this is my table:

indx, data_lname, data_fname, data_mname, data_dob, data_mobile


What I'm doing right now is selecting all the rows in one query and then comparing each row with the same table using a different query. This works ok but slow as hell.

Can this be done with a single query?

```
private sub poplist()
DBstrSQL = "SELECT * from tbl_data order by indx asc"
Dim myCmd As New MySqlCommand
myCmd.CommandTimeout = 300
myCmd.CommandText = DBstrSQL
myCmd.Connection = MySqlConn
Dim myReader As MySqlDataReader = myCmd.ExecuteReader()
If myReader.HasRows = True Then
While myReader.Read()
if checkifdup(myReader.GetString("indx"),myReader.GetString("data_fname"),myReader.GetString("data_mname"),myReader.GetString("data_lname"),myReader.GetString("data_dob"),myReader.GetString("data_mobile"))=false then
With lstUnique.Items.Add(myreader.getstring("data_lname"))
.SubItems.Add(myreader.getstring("data_fname"))
.SubItems.Add(myreader.getstring("data_mname"))
.SubItems.Add(myreader.getstring("data_dob"))
.SubItems.Add(myreader.getstring("data_mobile"))
End With
else
With lstDup.Items.Add(myreader.getstring("data_lname"))
.SubItems.Add(myreader.getstring("data_fname"))
.SubItems.Add(myreader.getstring("data_mname"))
.SubItems.Add(myreader.getstring("data_dob"))
.SubItems.Add(myreader.getstring("data_mobile"))
End With
end if
End While
end if
myReader.Close()
end sub

private function checkifdup(dataindx sa string, data1 as string,da

Solution

You should never run a query in a loop, especially a loop where the number of queries issued scales with the size of the data. There is almost always a way to formulate the SQL such that you get the results you want with a small, fixed number of queries.

In your case, you want two queries: one to find rows that are unique (ignoring the indx column), and another to find the rows that appear more than once (ignoring the indx column). You could formulate those queries as:

SELECT data_lname, data_fname, data_mname, data_dob, data_mobile
    FROM tbl_data
    GROUP BY data_lname, data_fname, data_mname, data_dob, data_mobile
    HAVING COUNT(indx) = 1
    ORDER BY indx;

SELECT data_lname, data_fname, data_mname, data_dob, data_mobile
    FROM tbl_data
    GROUP BY data_lname, data_fname, data_mname, data_dob, data_mobile
    HAVING COUNT(indx) > 1
    ORDER BY indx;


Use those queries to populate lstUnique and lstDup, respectively.

Edit: I read the code carelessly, and misinterpreted the criteria for considering two records to be "duplicates". I would reformulate the query close to the way your VB code worked.

To find the unique records:

SELECT *
    FROM tbl_data AS a
    WHERE NOT EXISTS (
        SELECT indx
            FROM tbl_data AS b
            WHERE
               a.indx <> b.indx
               AND a.data_lname = b.data_lname
               AND a.data_fname = b.data_fname
               AND a.data_mname = b.data_mname
               AND (a.data_dob = b.data_dob OR a.data_mobile = b.data_mobile)
    );


To find the records with duplicates, change WHERE NOT EXISTS to WHERE EXISTS.

This assumes that none of the fields can have a NULL value.

For performance, be sure that indexes exist on the table. I assume that indx, being the primary key, already has a UNIQUE INDEX. In addition, there should be an index on (data_lname, data_fname, data_mname).

Code Snippets

SELECT data_lname, data_fname, data_mname, data_dob, data_mobile
    FROM tbl_data
    GROUP BY data_lname, data_fname, data_mname, data_dob, data_mobile
    HAVING COUNT(indx) = 1
    ORDER BY indx;

SELECT data_lname, data_fname, data_mname, data_dob, data_mobile
    FROM tbl_data
    GROUP BY data_lname, data_fname, data_mname, data_dob, data_mobile
    HAVING COUNT(indx) > 1
    ORDER BY indx;
SELECT *
    FROM tbl_data AS a
    WHERE NOT EXISTS (
        SELECT indx
            FROM tbl_data AS b
            WHERE
               a.indx <> b.indx
               AND a.data_lname = b.data_lname
               AND a.data_fname = b.data_fname
               AND a.data_mname = b.data_mname
               AND (a.data_dob = b.data_dob OR a.data_mobile = b.data_mobile)
    );

Context

StackExchange Code Review Q#46886, answer score: 2

Revisions (0)

No revisions yet.