snippetsqlMinor
How to setup complex multi-column index for massive table
Viewed 0 times
multimassivecolumnsetupforhowindextablecomplex
Problem
I've searched what I could search on Google, SO, etc and haven't found an answer that would seem to fit what I'm looking for. I'm hesitant to make a post about this as I'm sure the answer is out there somewhere, I just haven't been able to find it :-\
I have setup a table in MySQL that I really could use some insight on. I have an existing index and it works pretty well in some cases, and others it takes 20+ seconds to run. Let me give you some background.
It's a MyISAM table with fixed rows (every column is an INT(11) ) and this table currently has 150,000,000 rows on it (10.2 GB total). I use this to track analytics run on the web software we use as other open source alternatives (like Piwik) were simply overkill and we needed direct access to the "stuff". That aside here's the basic structure of the table. (again all INT(11) as I read and understood that indexes with the same type and length are most effective).
And here's the following range of values for each column.
The query that I am trying to run is basically to get a distinct count of all the visitors that did a certain action for a location during a particular time. (in other words, must have location, action, and ts in the query)
I have a multi-column index on location, region, action, ts and this query can take 20 to 30 seconds to execute. I also have one other index that is simple visitor_id, ts, and that one doesn't show any issues I assume because visitor_id has such a high cardinality. EXPLAIN SELECT shows I'm hitting the index and it seems to do as well as it could do.
```
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE table range visitor_id
I have setup a table in MySQL that I really could use some insight on. I have an existing index and it works pretty well in some cases, and others it takes 20+ seconds to run. Let me give you some background.
It's a MyISAM table with fixed rows (every column is an INT(11) ) and this table currently has 150,000,000 rows on it (10.2 GB total). I use this to track analytics run on the web software we use as other open source alternatives (like Piwik) were simply overkill and we needed direct access to the "stuff". That aside here's the basic structure of the table. (again all INT(11) as I read and understood that indexes with the same type and length are most effective).
id region_id location_id action_id visitor_id ts url_id employee_idAnd here's the following range of values for each column.
150M <20 <3000 <10 <40M unix_time <20k <200kThe query that I am trying to run is basically to get a distinct count of all the visitors that did a certain action for a location during a particular time. (in other words, must have location, action, and ts in the query)
SELECT COUNT(DISTINCT visitor_id) FROM table WHERE location_id = # AND region_id = # AND action_id = # AND ts BETWEEN x AND yI have a multi-column index on location, region, action, ts and this query can take 20 to 30 seconds to execute. I also have one other index that is simple visitor_id, ts, and that one doesn't show any issues I assume because visitor_id has such a high cardinality. EXPLAIN SELECT shows I'm hitting the index and it seems to do as well as it could do.
```
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE table range visitor_id
Solution
To answer your second question:
MySQL does not have a parallel query execution engine, so even if you partition the query, you are still single threaded. This will eventually kill your scale.
However, you could partition the table by
The output of these parallel queries (which you could store in a temp table as they run) is trivially combinable into the final result by simply adding the distinct counts together.
This is very similar to sharding, but instead of doing it across machines, you are doing it on the same table. By picking a good hash function to generate visitor_id (for example, a modulo or bit reversal if the original id is generated with a AUTO_INCREMENT) you can ensure that all partitions are approximately equal sized.
The reason you want to partition by
And this one in parallel:
Because you know the
You would of course need to pick the partition boundaries in such a way that partitions have approximately the same size.
I will let ypercube file the answer to the indexing question as this is the one that deserves the reward.
MySQL does not have a parallel query execution engine, so even if you partition the query, you are still single threaded. This will eventually kill your scale.
However, you could partition the table by
visitor_id. This would allow you to run several queries (one per partition) in parallel, all of them form:SELECT COUNT(DISTINCT visitor_id)
FROM table WHERE location_id = #
AND region_id = #
AND action_id = # AND ts BETWEEN x AND y
AND visitor_id BETWEEN and The output of these parallel queries (which you could store in a temp table as they run) is trivially combinable into the final result by simply adding the distinct counts together.
This is very similar to sharding, but instead of doing it across machines, you are doing it on the same table. By picking a good hash function to generate visitor_id (for example, a modulo or bit reversal if the original id is generated with a AUTO_INCREMENT) you can ensure that all partitions are approximately equal sized.
The reason you want to partition by
visitor_id and not one of the other columns is that it makes the DISTINCT additive across partitions. For example, consider a table with two partitions. One holds visitor_id 0-99 in one holds and 100-199. You can now express two queries that can run in parallel:INSERT INTO TempResult(visitor_id)
SELECT COUNT(DISTINCT visitor_id)
FROM table WHERE location_id = #
AND region_id = #
AND action_id = # AND ts BETWEEN x AND y
AND visitor_id BETWEEN 0 and 99And this one in parallel:
INSERT INTO TempResults (visitor_id)
SELECT COUNT(DISTINCT visitor_id)
FROM table WHERE location_id = #
AND region_id = #
AND action_id = # AND ts BETWEEN x AND y
AND visitor_id BETWEEN 100 and 199Because you know the
visitor_id is not overlapping between partitions, the final result is:SELECT SUM(visitor_id) FROM TempResultsYou would of course need to pick the partition boundaries in such a way that partitions have approximately the same size.
I will let ypercube file the answer to the indexing question as this is the one that deserves the reward.
Code Snippets
SELECT COUNT(DISTINCT visitor_id)
FROM table WHERE location_id = #
AND region_id = #
AND action_id = # AND ts BETWEEN x AND y
AND visitor_id BETWEEN <partition_start> and <partition_end>INSERT INTO TempResult(visitor_id)
SELECT COUNT(DISTINCT visitor_id)
FROM table WHERE location_id = #
AND region_id = #
AND action_id = # AND ts BETWEEN x AND y
AND visitor_id BETWEEN 0 and 99INSERT INTO TempResults (visitor_id)
SELECT COUNT(DISTINCT visitor_id)
FROM table WHERE location_id = #
AND region_id = #
AND action_id = # AND ts BETWEEN x AND y
AND visitor_id BETWEEN 100 and 199SELECT SUM(visitor_id) FROM TempResultsContext
StackExchange Database Administrators Q#74865, answer score: 4
Revisions (0)
No revisions yet.