patternsqlMinor
Separating Duplicates from Uniques
Viewed 0 times
uniquesfromseparatingduplicates
Problem
I have a MySQL table with a very large data. What I need is to find and separate duplicates from the unique rows
Let's say this is my table:
What I'm doing right now is selecting all the rows in one query and then comparing each row with the same table using a different query. This works ok but slow as hell.
Can this be done with a single query?
```
private sub poplist()
DBstrSQL = "SELECT * from tbl_data order by indx asc"
Dim myCmd As New MySqlCommand
myCmd.CommandTimeout = 300
myCmd.CommandText = DBstrSQL
myCmd.Connection = MySqlConn
Dim myReader As MySqlDataReader = myCmd.ExecuteReader()
If myReader.HasRows = True Then
While myReader.Read()
if checkifdup(myReader.GetString("indx"),myReader.GetString("data_fname"),myReader.GetString("data_mname"),myReader.GetString("data_lname"),myReader.GetString("data_dob"),myReader.GetString("data_mobile"))=false then
With lstUnique.Items.Add(myreader.getstring("data_lname"))
.SubItems.Add(myreader.getstring("data_fname"))
.SubItems.Add(myreader.getstring("data_mname"))
.SubItems.Add(myreader.getstring("data_dob"))
.SubItems.Add(myreader.getstring("data_mobile"))
End With
else
With lstDup.Items.Add(myreader.getstring("data_lname"))
.SubItems.Add(myreader.getstring("data_fname"))
.SubItems.Add(myreader.getstring("data_mname"))
.SubItems.Add(myreader.getstring("data_dob"))
.SubItems.Add(myreader.getstring("data_mobile"))
End With
end if
End While
end if
myReader.Close()
end sub
private function checkifdup(dataindx sa string, data1 as string,da
Let's say this is my table:
indx, data_lname, data_fname, data_mname, data_dob, data_mobileWhat I'm doing right now is selecting all the rows in one query and then comparing each row with the same table using a different query. This works ok but slow as hell.
Can this be done with a single query?
```
private sub poplist()
DBstrSQL = "SELECT * from tbl_data order by indx asc"
Dim myCmd As New MySqlCommand
myCmd.CommandTimeout = 300
myCmd.CommandText = DBstrSQL
myCmd.Connection = MySqlConn
Dim myReader As MySqlDataReader = myCmd.ExecuteReader()
If myReader.HasRows = True Then
While myReader.Read()
if checkifdup(myReader.GetString("indx"),myReader.GetString("data_fname"),myReader.GetString("data_mname"),myReader.GetString("data_lname"),myReader.GetString("data_dob"),myReader.GetString("data_mobile"))=false then
With lstUnique.Items.Add(myreader.getstring("data_lname"))
.SubItems.Add(myreader.getstring("data_fname"))
.SubItems.Add(myreader.getstring("data_mname"))
.SubItems.Add(myreader.getstring("data_dob"))
.SubItems.Add(myreader.getstring("data_mobile"))
End With
else
With lstDup.Items.Add(myreader.getstring("data_lname"))
.SubItems.Add(myreader.getstring("data_fname"))
.SubItems.Add(myreader.getstring("data_mname"))
.SubItems.Add(myreader.getstring("data_dob"))
.SubItems.Add(myreader.getstring("data_mobile"))
End With
end if
End While
end if
myReader.Close()
end sub
private function checkifdup(dataindx sa string, data1 as string,da
Solution
You should never run a query in a loop, especially a loop where the number of queries issued scales with the size of the data. There is almost always a way to formulate the SQL such that you get the results you want with a small, fixed number of queries.
In your case, you want two queries: one to find rows that are unique (ignoring the
Use those queries to populate
Edit: I read the code carelessly, and misinterpreted the criteria for considering two records to be "duplicates". I would reformulate the query close to the way your VB code worked.
To find the unique records:
To find the records with duplicates, change
This assumes that none of the fields can have a NULL value.
For performance, be sure that indexes exist on the table. I assume that
In your case, you want two queries: one to find rows that are unique (ignoring the
indx column), and another to find the rows that appear more than once (ignoring the indx column). You could formulate those queries as:SELECT data_lname, data_fname, data_mname, data_dob, data_mobile
FROM tbl_data
GROUP BY data_lname, data_fname, data_mname, data_dob, data_mobile
HAVING COUNT(indx) = 1
ORDER BY indx;
SELECT data_lname, data_fname, data_mname, data_dob, data_mobile
FROM tbl_data
GROUP BY data_lname, data_fname, data_mname, data_dob, data_mobile
HAVING COUNT(indx) > 1
ORDER BY indx;Use those queries to populate
lstUnique and lstDup, respectively.Edit: I read the code carelessly, and misinterpreted the criteria for considering two records to be "duplicates". I would reformulate the query close to the way your VB code worked.
To find the unique records:
SELECT *
FROM tbl_data AS a
WHERE NOT EXISTS (
SELECT indx
FROM tbl_data AS b
WHERE
a.indx <> b.indx
AND a.data_lname = b.data_lname
AND a.data_fname = b.data_fname
AND a.data_mname = b.data_mname
AND (a.data_dob = b.data_dob OR a.data_mobile = b.data_mobile)
);To find the records with duplicates, change
WHERE NOT EXISTS to WHERE EXISTS.This assumes that none of the fields can have a NULL value.
For performance, be sure that indexes exist on the table. I assume that
indx, being the primary key, already has a UNIQUE INDEX. In addition, there should be an index on (data_lname, data_fname, data_mname).Code Snippets
SELECT data_lname, data_fname, data_mname, data_dob, data_mobile
FROM tbl_data
GROUP BY data_lname, data_fname, data_mname, data_dob, data_mobile
HAVING COUNT(indx) = 1
ORDER BY indx;
SELECT data_lname, data_fname, data_mname, data_dob, data_mobile
FROM tbl_data
GROUP BY data_lname, data_fname, data_mname, data_dob, data_mobile
HAVING COUNT(indx) > 1
ORDER BY indx;SELECT *
FROM tbl_data AS a
WHERE NOT EXISTS (
SELECT indx
FROM tbl_data AS b
WHERE
a.indx <> b.indx
AND a.data_lname = b.data_lname
AND a.data_fname = b.data_fname
AND a.data_mname = b.data_mname
AND (a.data_dob = b.data_dob OR a.data_mobile = b.data_mobile)
);Context
StackExchange Code Review Q#46886, answer score: 2
Revisions (0)
No revisions yet.