patternsqlMinor
Running a query on an improperly indexed, 600gb table
Viewed 0 times
tablequery600gbindexedrunningimproperly
Problem
This is my first post on here and I'm really baffled by this. I have joined a team and somewhere in the past a horrific table was created which has been bloated to a point of 600+gb of data.
Originally this data was supposed to come in for one state, but data from the entire country has been storing here for a long period of time.
I have been put in charge of running a query to isolate a
To make things even worse, the database is located on an old server which runs on a quad p3 processor with only about 4gb of ram so it's incredibly slow.
Query I would like to execute, but for obvious reasons cringe just looking at:
Table structure:
``
Originally this data was supposed to come in for one state, but data from the entire country has been storing here for a long period of time.
I have been put in charge of running a query to isolate a
LATITUDE/LONGITUDE bounding box in this momentous data set within a certain date range.To make things even worse, the database is located on an old server which runs on a quad p3 processor with only about 4gb of ram so it's incredibly slow.
Query I would like to execute, but for obvious reasons cringe just looking at:
INSERT INTO gpsdata.vehicle_location_wa SELECT * FROM gpsdata.vehicle_location
WHERE LOCATION_TIMESTAMP BETWEEN '2014-03-20' AND '2014-07-20'
AND LATITUDE BETWEEN 46.86275036 AND 47.80149551
AND LONGITUDE BETWEEN -122.44599707 AND -116.94458691;Table structure:
``
CREATE TABLE vehicle_location (
SEQ_ID decimal(10 , 0 ) default NULL,
DEVICE_ID varchar(100) collate utf8_unicode_ci default NULL,
DATA_TYPE varchar(80) collate utf8_unicode_ci default NULL,
DATA_DESC varchar(180) collate utf8_unicode_ci default NULL,
SPEED decimal(6 , 3 ) default '0.000',
SPEED_UOM char(3) collate utf8_unicode_ci default NULL,
DIRECTION decimal(3 , 0 ) default NULL,
DURATION decimal(10 , 0 ) default '0',
LATITUDE decimal(16 , 13 ) default NULL,
LONGITUDE decimal(16 , 13 ) default NULL,
GPS_STATUS tinyint(4) default NULL,
LOCATION_TIMESTAMP datetime default NULL,
TIMEZONE tinyint(1) default NULL,
DST tinyint(1) default NULL,
STATUS tinyint(1) default NULL,
MILEAGE decimal(15 , 2 ) default '0.00',
QUEUE_TIME datetime default NULL,
OBSERVATION_TIME datetime default NULL,
ID BIGINT NOT NULL auto_increment,
PRIMARY KEY (ID),
KEY DEVICE_ID (DEVICE_ID),
KEY Location_Timestamp (LOCSolution
You have a big problem for one reason: BTree indexes (which are the only format available for InnoDB) are highly inefficient for filtering on more than 1 range.
There is one thing that you must understand: in general, using more than 1 index per table access is not possible/efficient (there are some cases where union_merge is faster, but that is an exception)- in general, for AND conditions you want a single index over several fields.
So, no problem then, isn't it? We just index
There is an exception to this, which is a new feature found in 5.6 called Index condition pushdown (that doesn't solve the problem completely, but makes the 3-column index more useful than a 1-column one).
So what are the options?
-
Convert some of the conditions to the
may actually be much faster. Of course, this depends on the selectivity of the clauses, etc.
Can we do the same for LONGITUDE. In same cases, yes, you could create an additional field called
The 3 columns of the indes will be used in this case. But to be frank, that starts to become a bit crazy.
-
Use the geospatial extensions. MySQL has support for spatial types like POINT, though precisely for geographical applications, and with its own indexes (R-trees), which can index in several dimensions at te same time -so they speed up multirange searches- and have its own, optimised set of bounding box functions (
This is great, isn't it? Bad news: it is only available for MyISAM in 5.5 and 5.6, and it will only be available for InnoDB in 5.7. The change is also not very transparent (not as easy as just add an index).
So, the fact that you have an unindexed table is possible due that your options are very narrow, specially for 5.5. If you are a bit flexible about the table structure (partitioning, as you suggested) or performing other tricks, there are ways to walkaround it, but it needs some care.
Some people end up combining MySQL with external tools precisely for this reason, and we will not have a real solution until 5.7.
Finally, there is the problem of the large table. This can become a problem. InnoDB is only efficient when most of its non-leaf parts of the primary key fits into memory. As ypercube says, a 3.5TB table on a 4GB server is something scary. Inserting to that table must be slow. You will need more ram and thinking about partitioning it, for example, as you said, by state (if that is ok for your queries) of by timestamp and get advantage of the manual or automatic prunning.
There is one thing that you must understand: in general, using more than 1 index per table access is not possible/efficient (there are some cases where union_merge is faster, but that is an exception)- in general, for AND conditions you want a single index over several fields.
So, no problem then, isn't it? We just index
(LOCATION_TIMESTAMP, LATITUDE, LONGITUDE) right? Sorry, no. As you are using a range for the conditions on all the columns, only the first column of the index will be used. The reason is easy to see, but difficult to explain without drawing a tree, so I hope you can trust me on this and I can later explain you why.There is an exception to this, which is a new feature found in 5.6 called Index condition pushdown (that doesn't solve the problem completely, but makes the 3-column index more useful than a 1-column one).
So what are the options?
-
Convert some of the conditions to the
ref type, that means, convert them into equality checks. That is something that can be done with the timestamps. It may sound a bit weird, but indexing (LOCATION_TIMESTAMP, LATITUDE) and having a query like this:WHERE LOCATION_TIMESTAMP IN ('2014-03-20', '2014-03-21', ..., '2014-07-20') AND
LATITUDE BETWEEN 46.86275036 AND 47.80149551;may actually be much faster. Of course, this depends on the selectivity of the clauses, etc.
Can we do the same for LONGITUDE. In same cases, yes, you could create an additional field called
LATITUDE_INT with floor(LATITUDE), and do something like:WHERE LOCATION_TIMESTAMP IN ('2014-03-20', '2014-03-21', ..., '2014-07-20') AND
LATITUDE_INT IN (46, 47) AND
LATITUDE BETWEEN 46.86275036 AND 47.80149551 AND
LONGITUDE BETWEEN -122.44599707 AND -116.94458691;The 3 columns of the indes will be used in this case. But to be frank, that starts to become a bit crazy.
-
Use the geospatial extensions. MySQL has support for spatial types like POINT, though precisely for geographical applications, and with its own indexes (R-trees), which can index in several dimensions at te same time -so they speed up multirange searches- and have its own, optimised set of bounding box functions (
MBRContains() is what you want).This is great, isn't it? Bad news: it is only available for MyISAM in 5.5 and 5.6, and it will only be available for InnoDB in 5.7. The change is also not very transparent (not as easy as just add an index).
So, the fact that you have an unindexed table is possible due that your options are very narrow, specially for 5.5. If you are a bit flexible about the table structure (partitioning, as you suggested) or performing other tricks, there are ways to walkaround it, but it needs some care.
Some people end up combining MySQL with external tools precisely for this reason, and we will not have a real solution until 5.7.
Finally, there is the problem of the large table. This can become a problem. InnoDB is only efficient when most of its non-leaf parts of the primary key fits into memory. As ypercube says, a 3.5TB table on a 4GB server is something scary. Inserting to that table must be slow. You will need more ram and thinking about partitioning it, for example, as you said, by state (if that is ok for your queries) of by timestamp and get advantage of the manual or automatic prunning.
Code Snippets
WHERE LOCATION_TIMESTAMP IN ('2014-03-20', '2014-03-21', ..., '2014-07-20') AND
LATITUDE BETWEEN 46.86275036 AND 47.80149551;WHERE LOCATION_TIMESTAMP IN ('2014-03-20', '2014-03-21', ..., '2014-07-20') AND
LATITUDE_INT IN (46, 47) AND
LATITUDE BETWEEN 46.86275036 AND 47.80149551 AND
LONGITUDE BETWEEN -122.44599707 AND -116.94458691;Context
StackExchange Database Administrators Q#76437, answer score: 2
Revisions (0)
No revisions yet.