HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

How to optimize a very slow query with joins and group by?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
howgroupwithqueryslowoptimizeveryandjoins

Problem

I have this GTFS database :

All the ***_id columns are indexed, and there are foreign keys between tables (such as a FK on trips.route_id to routes.route_id).

So basically the link between a stop and a route is routestripsstop_timesstops.

I have created and filled a table to mention all stops that are physically close to each other, filling stop_connections. Pretty simple.

Now what I want is to get from route A all other routes that can be reached by one of the stops of this route.

So basically I want to go like:

A.route → A.trips (many) → A.trips.stop_times(many^2) → A.trips.stop_times.stops(many^3)


then use stop_connections to get related stops, and then go backwards to get the routes from these stops:

connected_stops → connected_stops.stop_times → connected_stops.stop_times.trips → connected_stops.stop_times.trips.routes


I would get a table similar to this:

+---------------+-------------+
| from_route_id | to_route_id |
+---------------+-------------+
|          0001 |        0002 |
|          0001 |        0004 |
|          0001 |        0005 |
|          0002 |        0001 |
|          0002 |        0005 |
+---------------+-------------+


This is my query:

```
-- get from/to route IDs
select
r.route_id as from_route_id,
c_t.route_id as to_route_id

  • start from route A


from routes r

-- going down on all trips on route A
left join trips t on t.route_id = r.route_id

-- going down on all stop_times for A's trips
left join stop_times st on st.trip_id = t.trip_id

-- going down on all stops use by A's trips/vehicles
left join stops s on s.stop_id = st.stop_id

-- get connected stops
inner join stop_connections c_s on c_s.from_stop_id = s.stop_id

-- going up to stop_times
inner join stop_times c_st on c_st.stop_id = c_s.to_stop_id

-- going up to trips
inner join trips c_t on c_t.trip_id = c_st.trip_id and c_t.route_id <> r.route_id

-- no need to actually go up to routes because we already have the r

Solution

OK, I ended up adding another lookup table:

CREATE TABLE IF NOT EXISTS `stops_routes` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `stop_id` varchar(100) NOT NULL,
  `route_id` varchar(100) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `stop_route` (`stop_id`,`route_id`),
  KEY `stop_id` (`stop_id`),
  KEY `route_id` (`route_id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;


Filling it was fairly fast:

mysql> insert into stops_routes (stop_id, route_id)
    ->
    -> select
    ->     s.stop_id,
    ->     r.route_id as from_route_id
    ->
    -> from routes r
    -> left join trips t on t.route_id = r.route_id
    -> left join stop_times st on st.trip_id = t.trip_id
    -> left join stops s on s.stop_id = st.stop_id
    -> group by s.stop_id, r.route_id;
Query OK, 3496 rows affected (8.38 sec)
Records: 3496  Duplicates: 0  Warnings: 0


Using it is blazingly fast:

mysql> select
    ->     r.route_id as from_route_id,
    ->     c_sr.route_id as to_route_id
    ->
    -> from routes r
    ->
    -> left join stops_routes sr on sr.route_id = r.route_id
    -> left join stop_connections c_s on c_s.from_stop_id = sr.stop_id
    -> left join stops_routes c_sr on c_sr.stop_id = c_s.to_stop_id
    ->
    -> where r.route_id <> c_sr.route_id
    -> group by r.route_id, c_sr.route_id
    -> limit 10;
+---------------+-------------+
| from_route_id | to_route_id |
+---------------+-------------+
| 0001          | 0002        |
| 0001          | 0003        |
| 0001          | 0004        |
| 0001          | 0005        |
| 0001          | 0006        |
| 0001          | 0008        |
| 0001          | 0009        |
| 0001          | 0011        |
| 0001          | 0014        |
| 0001          | 0031        |
+---------------+-------------+
10 rows in set (0.63 sec)


Now I can fill my last lookup table (set of connections between every routes on my GTFS network):

mysql> insert into route_connections (from_route_id, to_route_id)
    -> select
    ->     r.route_id as from_route_id,
    ->     c_sr.route_id as to_route_id
    ->
    -> from routes r
    ->
    -> left join stops_routes sr on sr.route_id = r.route_id
    -> left join stop_connections c_s on c_s.from_stop_id = sr.stop_id
    -> left join stops_routes c_sr on c_sr.stop_id = c_s.to_stop_id
    ->
    -> where r.route_id <> c_sr.route_id
    -> group by r.route_id, c_sr.route_id;
Query OK, 2848 rows affected (0.31 sec)
Records: 2848  Duplicates: 0  Warnings: 0


Amazingly fast. I guess the engine couldn't break up the steps to optimize this.

I'd still be interested to know if it would be possible to get the same result (from route to route connections table) using only one sub-second or sub-minute query.

Code Snippets

CREATE TABLE IF NOT EXISTS `stops_routes` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `stop_id` varchar(100) NOT NULL,
  `route_id` varchar(100) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `stop_route` (`stop_id`,`route_id`),
  KEY `stop_id` (`stop_id`),
  KEY `route_id` (`route_id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;
mysql> insert into stops_routes (stop_id, route_id)
    ->
    -> select
    ->     s.stop_id,
    ->     r.route_id as from_route_id
    ->
    -> from routes r
    -> left join trips t on t.route_id = r.route_id
    -> left join stop_times st on st.trip_id = t.trip_id
    -> left join stops s on s.stop_id = st.stop_id
    -> group by s.stop_id, r.route_id;
Query OK, 3496 rows affected (8.38 sec)
Records: 3496  Duplicates: 0  Warnings: 0
mysql> select
    ->     r.route_id as from_route_id,
    ->     c_sr.route_id as to_route_id
    ->
    -> from routes r
    ->
    -> left join stops_routes sr on sr.route_id = r.route_id
    -> left join stop_connections c_s on c_s.from_stop_id = sr.stop_id
    -> left join stops_routes c_sr on c_sr.stop_id = c_s.to_stop_id
    ->
    -> where r.route_id <> c_sr.route_id
    -> group by r.route_id, c_sr.route_id
    -> limit 10;
+---------------+-------------+
| from_route_id | to_route_id |
+---------------+-------------+
| 0001          | 0002        |
| 0001          | 0003        |
| 0001          | 0004        |
| 0001          | 0005        |
| 0001          | 0006        |
| 0001          | 0008        |
| 0001          | 0009        |
| 0001          | 0011        |
| 0001          | 0014        |
| 0001          | 0031        |
+---------------+-------------+
10 rows in set (0.63 sec)
mysql> insert into route_connections (from_route_id, to_route_id)
    -> select
    ->     r.route_id as from_route_id,
    ->     c_sr.route_id as to_route_id
    ->
    -> from routes r
    ->
    -> left join stops_routes sr on sr.route_id = r.route_id
    -> left join stop_connections c_s on c_s.from_stop_id = sr.stop_id
    -> left join stops_routes c_sr on c_sr.stop_id = c_s.to_stop_id
    ->
    -> where r.route_id <> c_sr.route_id
    -> group by r.route_id, c_sr.route_id;
Query OK, 2848 rows affected (0.31 sec)
Records: 2848  Duplicates: 0  Warnings: 0

Context

StackExchange Database Administrators Q#77597, answer score: 2

Revisions (0)

No revisions yet.