HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

How to setup complex multi-column index for massive table

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
multimassivecolumnsetupforhowindextablecomplex

Problem

I've searched what I could search on Google, SO, etc and haven't found an answer that would seem to fit what I'm looking for. I'm hesitant to make a post about this as I'm sure the answer is out there somewhere, I just haven't been able to find it :-\

I have setup a table in MySQL that I really could use some insight on. I have an existing index and it works pretty well in some cases, and others it takes 20+ seconds to run. Let me give you some background.

It's a MyISAM table with fixed rows (every column is an INT(11) ) and this table currently has 150,000,000 rows on it (10.2 GB total). I use this to track analytics run on the web software we use as other open source alternatives (like Piwik) were simply overkill and we needed direct access to the "stuff". That aside here's the basic structure of the table. (again all INT(11) as I read and understood that indexes with the same type and length are most effective).

id    region_id location_id action_id visitor_id ts        url_id employee_id


And here's the following range of values for each column.

150M  <20       <3000      <10       <40M        unix_time <20k   <200k


The query that I am trying to run is basically to get a distinct count of all the visitors that did a certain action for a location during a particular time. (in other words, must have location, action, and ts in the query)

SELECT COUNT(DISTINCT visitor_id) FROM table WHERE location_id = # AND region_id = # AND action_id = # AND ts BETWEEN x AND y


I have a multi-column index on location, region, action, ts and this query can take 20 to 30 seconds to execute. I also have one other index that is simple visitor_id, ts, and that one doesn't show any issues I assume because visitor_id has such a high cardinality. EXPLAIN SELECT shows I'm hitting the index and it seems to do as well as it could do.

```
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE table range visitor_id

Solution

To answer your second question:

MySQL does not have a parallel query execution engine, so even if you partition the query, you are still single threaded. This will eventually kill your scale.

However, you could partition the table by visitor_id. This would allow you to run several queries (one per partition) in parallel, all of them form:

SELECT COUNT(DISTINCT visitor_id) 
FROM table WHERE location_id = # 
AND region_id = # 
AND action_id = # AND ts BETWEEN x AND y
AND visitor_id BETWEEN  and 


The output of these parallel queries (which you could store in a temp table as they run) is trivially combinable into the final result by simply adding the distinct counts together.

This is very similar to sharding, but instead of doing it across machines, you are doing it on the same table. By picking a good hash function to generate visitor_id (for example, a modulo or bit reversal if the original id is generated with a AUTO_INCREMENT) you can ensure that all partitions are approximately equal sized.

The reason you want to partition by visitor_id and not one of the other columns is that it makes the DISTINCT additive across partitions. For example, consider a table with two partitions. One holds visitor_id 0-99 in one holds and 100-199. You can now express two queries that can run in parallel:

INSERT INTO TempResult(visitor_id)
SELECT COUNT(DISTINCT visitor_id) 
    FROM table WHERE location_id = # 
    AND region_id = # 
    AND action_id = # AND ts BETWEEN x AND y
    AND visitor_id BETWEEN 0 and 99


And this one in parallel:

INSERT INTO TempResults (visitor_id)
SELECT COUNT(DISTINCT visitor_id) 
    FROM table WHERE location_id = # 
    AND region_id = # 
    AND action_id = # AND ts BETWEEN x AND y
    AND visitor_id BETWEEN 100 and 199


Because you know the visitor_id is not overlapping between partitions, the final result is:

SELECT SUM(visitor_id) FROM TempResults


You would of course need to pick the partition boundaries in such a way that partitions have approximately the same size.

I will let ypercube file the answer to the indexing question as this is the one that deserves the reward.

Code Snippets

SELECT COUNT(DISTINCT visitor_id) 
FROM table WHERE location_id = # 
AND region_id = # 
AND action_id = # AND ts BETWEEN x AND y
AND visitor_id BETWEEN <partition_start> and <partition_end>
INSERT INTO TempResult(visitor_id)
SELECT COUNT(DISTINCT visitor_id) 
    FROM table WHERE location_id = # 
    AND region_id = # 
    AND action_id = # AND ts BETWEEN x AND y
    AND visitor_id BETWEEN 0 and 99
INSERT INTO TempResults (visitor_id)
SELECT COUNT(DISTINCT visitor_id) 
    FROM table WHERE location_id = # 
    AND region_id = # 
    AND action_id = # AND ts BETWEEN x AND y
    AND visitor_id BETWEEN 100 and 199
SELECT SUM(visitor_id) FROM TempResults

Context

StackExchange Database Administrators Q#74865, answer score: 4

Revisions (0)

No revisions yet.