patternsqlMinor
remove almost duplicates based on insert time
Viewed 0 times
insertalmosttimeremovebasedduplicates
Problem
Have a MySql table that records magnetic stripe swipe times from identity cards:
The unique key is there to prevent exact duplicate inserts.
However there is another source of almost duplicates where the card gets swiped twice in quick succession or even for some odd reason within a minute or so when there may be other cards swiped in between. In some ways this is like bounce on electronic switch contacts.
Here's a snippet of rows to illustrate:
What I want is a query that will remove entries like id 1663, that are within some specified time of rows with the same personalid, status=1, ride_taken=1 and source
Yes I have searched and inspected both here and on stackoverflow
such as https://stackoverflow.com/questions/18932/how-can-i-remove-duplicate-rows
for a similar situation but have not found an answer yet. If there is one already, please point me at it.
I'm at a loss how to apply a time range filter for the swipe_time to delete only similar rows within a specified time range.
CREATE TABLE `swipes` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`personalid` varchar(20) NOT NULL,
`swipe_time` datetime NOT NULL,
`status` tinyint(1) NOT NULL COMMENT '1=registered,0=not',
`ride_taken` tinyint(1) NOT NULL DEFAULT '0',
`source` tinyint(4) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `my_key` (`personalid`,`swipe_time`,`status`,`ride_taken`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8The unique key is there to prevent exact duplicate inserts.
However there is another source of almost duplicates where the card gets swiped twice in quick succession or even for some odd reason within a minute or so when there may be other cards swiped in between. In some ways this is like bounce on electronic switch contacts.
Here's a snippet of rows to illustrate:
id personalid swipe_time status ride_taken source
1661 C08877547 2012-10-21 01:12:08 1 1 3
1662 C09364782 2012-10-21 01:23:38 1 1 3
1663 C09364782 2012-10-21 01:23:48 1 1 3
1664 D09490557 2012-10-21 01:24:39 1 1 3What I want is a query that will remove entries like id 1663, that are within some specified time of rows with the same personalid, status=1, ride_taken=1 and source
Yes I have searched and inspected both here and on stackoverflow
such as https://stackoverflow.com/questions/18932/how-can-i-remove-duplicate-rows
for a similar situation but have not found an answer yet. If there is one already, please point me at it.
I'm at a loss how to apply a time range filter for the swipe_time to delete only similar rows within a specified time range.
Solution
Here is something very quick and dirty
Get the minimum datetime for each personalid, status, ride_taken
First let's load your sample data:
Here is the output of my proposed query
Give it a Try !!!
SUGGESTION #1
To accommodate the query better, add this index
SUGGESTION #2 (Optional)
To accommodate more than 255 swipe stations, perhaps
SUGGESTION #3 (Optional)
Some Transit Systems allow multiple swipes (up to 4 for PATH TransHudson) in one station. You could give some additional thought on this should you have to allow multiple swipes.
UPDATE 2012-10-24 17:30 EDT
If you are trying to limit within a time range, I have another viewpoint for you
Here is the subquery within my answer:
Let's say you want to limit to the Latest Entry Per 10 seconds. You will first have to generate a ten-second interval boundary like this:
```
mysql> SELECT personalid,status,ride_taken,swipe_time,
-> (swipe_time - INTERVAL MOD(UNIX_TIMESTAMP(swipe_time),10) SECOND) ten_sec_in
-> FROM swipes;
+------------+--------+------------+---------------------+---------------------+
| personalid | status | ride_taken | swipe_time | ten_sec_int |
+------------+--------+------------+---------------------+---------------------+
| C08877547 | 1 | 1 | 2012-10-21 01:12:08 | 2012-10-21 01:12:00 |
| C09364782 | 1 | 1 | 2012-10-21 01:23:38 | 2012-10-21 01:23:30 |
| C09364782 | 1 | 1 | 2012-10-21 01:23:48 | 2012-10-21 01:23:40 |
| D09490557 | 1 | 1 | 2012-10
Get the minimum datetime for each personalid, status, ride_taken
SELECT B.* FROM
(SELECT personalid,status,ride_taken,MIN(swipe_time) swipe_time
FROM swipes GROUP BY personalid,status,ride_taken) A
INNER JOIN swipes B USING (personalid,status,ride_taken,swipe_time);First let's load your sample data:
mysql> drop database if exists sdujan;
Query OK, 1 row affected (0.00 sec)
mysql> create database sdujan;
Query OK, 1 row affected (0.00 sec)
mysql> use sdujan;
Database changed
mysql> CREATE TABLE `swipes` (
-> `id` int(11) NOT NULL AUTO_INCREMENT,
-> `personalid` varchar(20) NOT NULL,
-> `swipe_time` datetime NOT NULL,
-> `status` tinyint(1) NOT NULL COMMENT '1=registered,0=not',
-> `ride_taken` tinyint(1) NOT NULL DEFAULT '0',
-> `source` tinyint(4) NOT NULL DEFAULT '0',
-> PRIMARY KEY (`id`),
-> UNIQUE KEY `my_key` (`personalid`,`swipe_time`,`status`,`ride_taken`)
-> ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.09 sec)
mysql> insert into swipes values
-> (1661,'C08877547','2012-10-21 01:12:08',1,1,3),
-> (1662,'C09364782','2012-10-21 01:23:38',1,1,3),
-> (1663,'C09364782','2012-10-21 01:23:48',1,1,3),
-> (1664,'D09490557','2012-10-21 01:24:39',1,1,3);
Query OK, 4 rows affected (0.00 sec)
Records: 4 Duplicates: 0 Warnings: 0
mysql> select * from swipes;
+------+------------+---------------------+--------+------------+--------+
| id | personalid | swipe_time | status | ride_taken | source |
+------+------------+---------------------+--------+------------+--------+
| 1661 | C08877547 | 2012-10-21 01:12:08 | 1 | 1 | 3 |
| 1662 | C09364782 | 2012-10-21 01:23:38 | 1 | 1 | 3 |
| 1663 | C09364782 | 2012-10-21 01:23:48 | 1 | 1 | 3 |
| 1664 | D09490557 | 2012-10-21 01:24:39 | 1 | 1 | 3 |
+------+------------+---------------------+--------+------------+--------+
4 rows in set (0.00 sec)
mysql>Here is the output of my proposed query
mysql> SELECT B.* FROM
-> (SELECT personalid,status,ride_taken,MIN(swipe_time) swipe_time
-> FROM swipes GROUP BY personalid,status,ride_taken) A
-> INNER JOIN swipes B USING (personalid,status,ride_taken,swipe_time);
+------+------------+---------------------+--------+------------+--------+
| id | personalid | swipe_time | status | ride_taken | source |
+------+------------+---------------------+--------+------------+--------+
| 1661 | C08877547 | 2012-10-21 01:12:08 | 1 | 1 | 3 |
| 1662 | C09364782 | 2012-10-21 01:23:38 | 1 | 1 | 3 |
| 1664 | D09490557 | 2012-10-21 01:24:39 | 1 | 1 | 3 |
+------+------------+---------------------+--------+------------+--------+
3 rows in set (0.00 sec)
mysql>Give it a Try !!!
SUGGESTION #1
To accommodate the query better, add this index
ALTER TABLE swipes ADD UNIQUE KEY `my_key2`
(`personalid`,`status`,`ride_taken`,`swipe_time`);SUGGESTION #2 (Optional)
To accommodate more than 255 swipe stations, perhaps
source should be smallint unsignedSUGGESTION #3 (Optional)
Some Transit Systems allow multiple swipes (up to 4 for PATH TransHudson) in one station. You could give some additional thought on this should you have to allow multiple swipes.
UPDATE 2012-10-24 17:30 EDT
If you are trying to limit within a time range, I have another viewpoint for you
Here is the subquery within my answer:
mysql> SELECT
-> personalid,status,ride_taken,MIN(swipe_time) swipe_time
-> FROM
-> swipes GROUP BY personalid,status,ride_taken
-> ;
+------------+--------+------------+---------------------+
| personalid | status | ride_taken | swipe_time |
+------------+--------+------------+---------------------+
| C08877547 | 1 | 1 | 2012-10-21 01:12:08 |
| C09364782 | 1 | 1 | 2012-10-21 01:23:38 |
| D09490557 | 1 | 1 | 2012-10-21 01:24:39 |
+------------+--------+------------+---------------------+
3 rows in set (0.00 sec)
mysql>Let's say you want to limit to the Latest Entry Per 10 seconds. You will first have to generate a ten-second interval boundary like this:
```
mysql> SELECT personalid,status,ride_taken,swipe_time,
-> (swipe_time - INTERVAL MOD(UNIX_TIMESTAMP(swipe_time),10) SECOND) ten_sec_in
-> FROM swipes;
+------------+--------+------------+---------------------+---------------------+
| personalid | status | ride_taken | swipe_time | ten_sec_int |
+------------+--------+------------+---------------------+---------------------+
| C08877547 | 1 | 1 | 2012-10-21 01:12:08 | 2012-10-21 01:12:00 |
| C09364782 | 1 | 1 | 2012-10-21 01:23:38 | 2012-10-21 01:23:30 |
| C09364782 | 1 | 1 | 2012-10-21 01:23:48 | 2012-10-21 01:23:40 |
| D09490557 | 1 | 1 | 2012-10
Code Snippets
SELECT B.* FROM
(SELECT personalid,status,ride_taken,MIN(swipe_time) swipe_time
FROM swipes GROUP BY personalid,status,ride_taken) A
INNER JOIN swipes B USING (personalid,status,ride_taken,swipe_time);mysql> drop database if exists sdujan;
Query OK, 1 row affected (0.00 sec)
mysql> create database sdujan;
Query OK, 1 row affected (0.00 sec)
mysql> use sdujan;
Database changed
mysql> CREATE TABLE `swipes` (
-> `id` int(11) NOT NULL AUTO_INCREMENT,
-> `personalid` varchar(20) NOT NULL,
-> `swipe_time` datetime NOT NULL,
-> `status` tinyint(1) NOT NULL COMMENT '1=registered,0=not',
-> `ride_taken` tinyint(1) NOT NULL DEFAULT '0',
-> `source` tinyint(4) NOT NULL DEFAULT '0',
-> PRIMARY KEY (`id`),
-> UNIQUE KEY `my_key` (`personalid`,`swipe_time`,`status`,`ride_taken`)
-> ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.09 sec)
mysql> insert into swipes values
-> (1661,'C08877547','2012-10-21 01:12:08',1,1,3),
-> (1662,'C09364782','2012-10-21 01:23:38',1,1,3),
-> (1663,'C09364782','2012-10-21 01:23:48',1,1,3),
-> (1664,'D09490557','2012-10-21 01:24:39',1,1,3);
Query OK, 4 rows affected (0.00 sec)
Records: 4 Duplicates: 0 Warnings: 0
mysql> select * from swipes;
+------+------------+---------------------+--------+------------+--------+
| id | personalid | swipe_time | status | ride_taken | source |
+------+------------+---------------------+--------+------------+--------+
| 1661 | C08877547 | 2012-10-21 01:12:08 | 1 | 1 | 3 |
| 1662 | C09364782 | 2012-10-21 01:23:38 | 1 | 1 | 3 |
| 1663 | C09364782 | 2012-10-21 01:23:48 | 1 | 1 | 3 |
| 1664 | D09490557 | 2012-10-21 01:24:39 | 1 | 1 | 3 |
+------+------------+---------------------+--------+------------+--------+
4 rows in set (0.00 sec)
mysql>mysql> SELECT B.* FROM
-> (SELECT personalid,status,ride_taken,MIN(swipe_time) swipe_time
-> FROM swipes GROUP BY personalid,status,ride_taken) A
-> INNER JOIN swipes B USING (personalid,status,ride_taken,swipe_time);
+------+------------+---------------------+--------+------------+--------+
| id | personalid | swipe_time | status | ride_taken | source |
+------+------------+---------------------+--------+------------+--------+
| 1661 | C08877547 | 2012-10-21 01:12:08 | 1 | 1 | 3 |
| 1662 | C09364782 | 2012-10-21 01:23:38 | 1 | 1 | 3 |
| 1664 | D09490557 | 2012-10-21 01:24:39 | 1 | 1 | 3 |
+------+------------+---------------------+--------+------------+--------+
3 rows in set (0.00 sec)
mysql>ALTER TABLE swipes ADD UNIQUE KEY `my_key2`
(`personalid`,`status`,`ride_taken`,`swipe_time`);mysql> SELECT
-> personalid,status,ride_taken,MIN(swipe_time) swipe_time
-> FROM
-> swipes GROUP BY personalid,status,ride_taken
-> ;
+------------+--------+------------+---------------------+
| personalid | status | ride_taken | swipe_time |
+------------+--------+------------+---------------------+
| C08877547 | 1 | 1 | 2012-10-21 01:12:08 |
| C09364782 | 1 | 1 | 2012-10-21 01:23:38 |
| D09490557 | 1 | 1 | 2012-10-21 01:24:39 |
+------------+--------+------------+---------------------+
3 rows in set (0.00 sec)
mysql>Context
StackExchange Database Administrators Q#27554, answer score: 2
Revisions (0)
No revisions yet.