patternsqlMinor
Get incremental counts of an aggregated value in a joined table
Viewed 0 times
countsjoinedvaluegetaggregatedincrementaltable
Problem
I have two tables in a MySQL 5.7.22 database:
For each increment of 10 points of weight (i.e. for 0, 10, 20, 30, etc), I want to get a count of posts that have a total weight less than or equal to that increment. I'd expect the results for that to look something like this:
The total weights are approximately normally distributed, with a few very low values and a few very high values (maximum is currently 1277), but the majority in the middle. There are just under 120,000 rows in
The relevant parts of the tables look like this:
So far, I've tried dropping the post ID and total weight into a view, then joining that view to itself to get an aggregated count:
``
SELECT
posts.id,
SUM(reasons.weight) AS reason_weight
FROM posts
INNER JOIN posts_reasons ON posts.id = posts_reasons.post_id
INNER JOIN reasons ON posts_reasons.reason_id = reasons.id
GROUP BY posts.id
);
SELECT
FLOOR(p1.reason_weight / 10) AS weight,
COUNT(DISTINCT p2.id) AS cumulative
FROM post_weights AS p1
INNER JOIN post_weights AS p2 ON FLOOR(p2.reason_weight / 10) <= FLOOR(p1.reason_
posts and reasons. Each post row has and belongs to many reason rows. Each reason has a weight associated with it, and each post therefore has a total aggregated weight associated with it.For each increment of 10 points of weight (i.e. for 0, 10, 20, 30, etc), I want to get a count of posts that have a total weight less than or equal to that increment. I'd expect the results for that to look something like this:
weight | post_count
--------+------------
0 | 0
10 | 5
20 | 12
30 | 18
... | ...
280 | 20918
290 | 21102
... | ...
1250 | 118005
1260 | 118039
1270 | 118040The total weights are approximately normally distributed, with a few very low values and a few very high values (maximum is currently 1277), but the majority in the middle. There are just under 120,000 rows in
posts, and around 120 in reasons. Each post has on average 5 or 6 reasons.The relevant parts of the tables look like this:
CREATE TABLE `posts` (
id BIGINT PRIMARY KEY
);
CREATE TABLE `reasons` (
id BIGINT PRIMARY KEY,
weight INT(11) NOT NULL
);
CREATE TABLE `posts_reasons` (
post_id BIGINT NOT NULL,
reason_id BIGINT NOT NULL,
CONSTRAINT fk_posts_reasons_posts (post_id) REFERENCES posts(id),
CONSTRAINT fk_posts_reasons_reasons (reason_id) REFERENCES reasons(id)
);So far, I've tried dropping the post ID and total weight into a view, then joining that view to itself to get an aggregated count:
``
CREATE VIEW post_weights` AS (SELECT
posts.id,
SUM(reasons.weight) AS reason_weight
FROM posts
INNER JOIN posts_reasons ON posts.id = posts_reasons.post_id
INNER JOIN reasons ON posts_reasons.reason_id = reasons.id
GROUP BY posts.id
);
SELECT
FLOOR(p1.reason_weight / 10) AS weight,
COUNT(DISTINCT p2.id) AS cumulative
FROM post_weights AS p1
INNER JOIN post_weights AS p2 ON FLOOR(p2.reason_weight / 10) <= FLOOR(p1.reason_
Solution
Using functions or expressions in JOIN conditions is usually a bad idea, I say usually because some optimisers can handle it fairly well and utilize indexes anyhow. I would suggest creating a table for the weights. Something like:
Make sure you have indexes on
A query like:
My machine at home is probably 5-6 years old, it has an Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz and 8Gb of ram.
uname -a
Linux dustbite 4.16.6-302.fc28.x86_64 #1 SMP Wed May 2 00:07:06 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
I tested against:
https://drive.google.com/open?id=1q3HZXW_qIZ01gU-Krms7qMJW3GCsOUP5
If performance is critical and nothing else helps you could create a summary table for:
You can maintain this table via triggers
Since there is a certain amount of work that needs to be done for each weight in weights, it may be beneficial to limit this table.
Since I had a lot of unnecesary rows in my weights table (max 2590), the restriction above cut the execution time from 9 down to 4 seconds.
CREATE TABLE weights
( weight int not null primary key
);
INSERT INTO weights (weight) VALUES (0),(10),(20),...(1270);Make sure you have indexes on
posts_reasons:CREATE UNIQUE INDEX ... ON posts_reasons (reason_id, post_id);A query like:
SELECT w.weight
, COUNT(1) as post_count
FROM weights w
JOIN ( SELECT pr.post_id, SUM(r.weight) as sum_weight
FROM reasons r
JOIN posts_reasons pr
ON r.id = pr.reason_id
GROUP BY pr.post_id
) as x
ON w.weight > x.sum_weight
GROUP BY w.weight;My machine at home is probably 5-6 years old, it has an Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz and 8Gb of ram.
uname -a
Linux dustbite 4.16.6-302.fc28.x86_64 #1 SMP Wed May 2 00:07:06 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
I tested against:
https://drive.google.com/open?id=1q3HZXW_qIZ01gU-Krms7qMJW3GCsOUP5
MariaDB [test3]> select @@version;
+-----------------+
| @@version |
+-----------------+
| 10.2.14-MariaDB |
+-----------------+
1 row in set (0.00 sec)
SELECT w.weight
, COUNT(1) as post_count
FROM weights w
JOIN ( SELECT pr.post_id, SUM(r.weight) as sum_weight
FROM reasons r
JOIN posts_reasons pr
ON r.id = pr.reason_id
GROUP BY pr.post_id
) as x
ON w.weight > x.sum_weight
GROUP BY w.weight;
+--------+------------+
| weight | post_count |
+--------+------------+
| 0 | 1 |
| 10 | 2591 |
| 20 | 4264 |
| 30 | 4386 |
| 40 | 5415 |
| 50 | 7499 |
[...]
| 1270 | 119283 |
| 1320 | 119286 |
| 1330 | 119286 |
[...]
| 2590 | 119286 |
+--------+------------+
256 rows in set (9.89 sec)If performance is critical and nothing else helps you could create a summary table for:
SELECT pr.post_id, SUM(r.weight) as sum_weight
FROM reasons r
JOIN posts_reasons pr
ON r.id = pr.reason_id
GROUP BY pr.post_idYou can maintain this table via triggers
Since there is a certain amount of work that needs to be done for each weight in weights, it may be beneficial to limit this table.
ON w.weight > x.sum_weight
WHERE w.weight <= (select MAX(sum_weights)
from (SELECT SUM(weight) as sum_weights
FROM reasons r
JOIN posts_reasons pr
ON r.id = pr.reason_id
GROUP BY pr.post_id) a
)
GROUP BY w.weightSince I had a lot of unnecesary rows in my weights table (max 2590), the restriction above cut the execution time from 9 down to 4 seconds.
Code Snippets
CREATE TABLE weights
( weight int not null primary key
);
INSERT INTO weights (weight) VALUES (0),(10),(20),...(1270);CREATE UNIQUE INDEX ... ON posts_reasons (reason_id, post_id);SELECT w.weight
, COUNT(1) as post_count
FROM weights w
JOIN ( SELECT pr.post_id, SUM(r.weight) as sum_weight
FROM reasons r
JOIN posts_reasons pr
ON r.id = pr.reason_id
GROUP BY pr.post_id
) as x
ON w.weight > x.sum_weight
GROUP BY w.weight;MariaDB [test3]> select @@version;
+-----------------+
| @@version |
+-----------------+
| 10.2.14-MariaDB |
+-----------------+
1 row in set (0.00 sec)
SELECT w.weight
, COUNT(1) as post_count
FROM weights w
JOIN ( SELECT pr.post_id, SUM(r.weight) as sum_weight
FROM reasons r
JOIN posts_reasons pr
ON r.id = pr.reason_id
GROUP BY pr.post_id
) as x
ON w.weight > x.sum_weight
GROUP BY w.weight;
+--------+------------+
| weight | post_count |
+--------+------------+
| 0 | 1 |
| 10 | 2591 |
| 20 | 4264 |
| 30 | 4386 |
| 40 | 5415 |
| 50 | 7499 |
[...]
| 1270 | 119283 |
| 1320 | 119286 |
| 1330 | 119286 |
[...]
| 2590 | 119286 |
+--------+------------+
256 rows in set (9.89 sec)SELECT pr.post_id, SUM(r.weight) as sum_weight
FROM reasons r
JOIN posts_reasons pr
ON r.id = pr.reason_id
GROUP BY pr.post_idContext
StackExchange Database Administrators Q#206934, answer score: 8
Revisions (0)
No revisions yet.