patternsqlModerate
Complex query to count votes with a redistribution system
Viewed 0 times
redistributionwithquerysystemcountcomplexvotes
Problem
I've spent the last couple of evenings trying to build a system for redistribution of votes in a ranked voting system. I've finally come up with the model for redistributing surpluses from a candidate whose number of votes exceeds the threshold.
I'm at the beginning of my learning curve and I've used Stack Overflow extensively during this process to gain knowledge about how to produce each and every query within this system.
Finally I've reached the result I want as can be seen below on the expense of a really confusing set of queries:
As mentioned - these queries are horrible. Absolutely and utterly HORRIBLE. Now I need to understand how this could be simplified and at the same time as effective or even more effective than this mess of queries.
Could anyone explain how to optimize this in a way that I can learn more about writing good queries?
This is my (awful) query:
```
SELECT vote_candidate candidate, original_votes votes, surplus_redistribution redistribution, (original_votes + surplus_redistribution) votes_after_redistribution
FROM (
SELECT c.vote_candidate, (
SELECT (
(MAX(votes_above_the_threshold) - (
SELECT FLOOR((COUNT(*) / (2 + 1)) + 1) threshold
FROM votes
)) / MAX(votes_above_the_threshold)
I'm at the beginning of my learning curve and I've used Stack Overflow extensively during this process to gain knowledge about how to produce each and every query within this system.
Finally I've reached the result I want as can be seen below on the expense of a really confusing set of queries:
+-----------+-------+----------------+----------------------------+
| CANDIDATE | VOTES | REDISTRIBUTION | VOTES_AFTER_REDISTRIBUTION |
+-----------+-------+----------------+----------------------------+
| 1 | 8 | -1 | 7 |
| 2 | 1 | 0.125 | 1.125 |
| 3 | 2 | 0.25 | 2.25 |
| 4 | 4 | 0.5 | 4.5 |
| 5 | 2 | 0.25 | 2.25 |
| 6 | 3 | 0 | 3 |
+-----------+-------+----------------+----------------------------+As mentioned - these queries are horrible. Absolutely and utterly HORRIBLE. Now I need to understand how this could be simplified and at the same time as effective or even more effective than this mess of queries.
Could anyone explain how to optimize this in a way that I can learn more about writing good queries?
This is my (awful) query:
```
SELECT vote_candidate candidate, original_votes votes, surplus_redistribution redistribution, (original_votes + surplus_redistribution) votes_after_redistribution
FROM (
SELECT c.vote_candidate, (
SELECT (
(MAX(votes_above_the_threshold) - (
SELECT FLOOR((COUNT(*) / (2 + 1)) + 1) threshold
FROM votes
)) / MAX(votes_above_the_threshold)
Solution
Your query is simultaneously awful and awesome. It is truly impressive, especially if you are self-taught.
Unfortunately, I don't believe it is correct. (Or, I may have misunderstood how you intended for the vote redistribution scheme to work.) For example, it reports that the redistribution in favour of Candidate 2 should be +0.125. Of the 8 ballots that have Candidate 1 as the first choice, 4 of them have Candidate 2 as the second choice. Shouldn't the redistribution be +0.5 then?
As further evidence, the
Schema
I would have changed the terminology of your
However, it may have been better to do away with that table altogether, and go straight to the
I would have used the name "votes" for what you called the
For the rest of the review, I'll stick with your original names, so as to avoid confusion.
Choice of RDBMS
For a complex query like this, I recommend using a more capable database than MySQL. For example, PostgreSQL (and others) offers several advantages:
-
Support for Common Table Expressions (
I've written my solution below using CTEs, but it's trivial to rewrite each CTE as a
-
Support for recursive queries. Using recursive CTEs, you should be able to write one query that processes multiple rounds of vote transfers. With MySQL, your alternative would be to write a stored procedure that manipulates a temporary table, which is less elegant.
-
Stricter interpretation of SQL. I often find that MySQL adheres loosely to the SQL standard, and makes sense of queries that shouldn't make sense. For example, you have some rather suspect SQL such as
that PostgreSQL would have rejected (see discussion about this excerpt below). You'll end up acquiring better habits developing on PostgreSQL than MySQL.
Overall impression
Unfortunately, your query is too complex for me to try to understand, especially since it yields wrong results. However, I can say that you did some things well:
Some problems are:
-
Convoluted reasoning that looks almost circular. For example, in the first two dozen lines…
Unfortunately, I don't believe it is correct. (Or, I may have misunderstood how you intended for the vote redistribution scheme to work.) For example, it reports that the redistribution in favour of Candidate 2 should be +0.125. Of the 8 ballots that have Candidate 1 as the first choice, 4 of them have Candidate 2 as the second choice. Shouldn't the redistribution be +0.5 then?
As further evidence, the
redistribution column should sum to 0. However, -1 + 0.125 + 0.25 + 0.5 + 0.25 + 0 = 0.125, which isn't neutral.Schema
I would have changed the terminology of your
votes table asCREATE TABLE ballots
( id INT NOT NULL AUTO_INCREMENT PRIMARY KEY
, choice1 INT
, choice2 INT
, choice3 INT
, choice4 INT
, choice5 INT
, choice6 INT
);However, it may have been better to do away with that table altogether, and go straight to the
vote_orders table, because it's a bad idea to design your schema to support a fixed number of candidates. (As the saying goes, there are only three numbers in computer science: zero, one, and many. You want to support many.) Consider the ballots table to be more of a data entry tool (possibly a TEMPORARY TABLE) than a storage format.I would have used the name "votes" for what you called the
vote_orders table:CREATE TABLE votes
( ballot INT NOT NULL
, rank INT NOT NULL
, candidate INT NOT NULL
, UNIQUE (ballot, rank)
, UNIQUE (ballot, candidate)
);For the rest of the review, I'll stick with your original names, so as to avoid confusion.
Choice of RDBMS
For a complex query like this, I recommend using a more capable database than MySQL. For example, PostgreSQL (and others) offers several advantages:
-
Support for Common Table Expressions (
WITH clauses), which help tremendously with managing the complexity. MySQL does not support CTEs. As a workaround, you can create views instead. However, it's harder to edit all of your views at once as easily as a single query with CTEs, so development is more cumbersome. Also, CTEs don't pollute the global namespace, and it's easier to see how subsequent CTEs depend on earlier CTEs.I've written my solution below using CTEs, but it's trivial to rewrite each CTE as a
CREATE VIEW statement instead, and I've verified that it does work with MySQL that way.-
Support for recursive queries. Using recursive CTEs, you should be able to write one query that processes multiple rounds of vote transfers. With MySQL, your alternative would be to write a stored procedure that manipulates a temporary table, which is less elegant.
-
Stricter interpretation of SQL. I often find that MySQL adheres loosely to the SQL standard, and makes sense of queries that shouldn't make sense. For example, you have some rather suspect SQL such as
WHERE votes_above_the_threshold = (
SELECT MAX(votes_above_the_threshold)
FROM vote_orders
)that PostgreSQL would have rejected (see discussion about this excerpt below). You'll end up acquiring better habits developing on PostgreSQL than MySQL.
Overall impression
Unfortunately, your query is too complex for me to try to understand, especially since it yields wrong results. However, I can say that you did some things well:
- Indentation is consistent and logical (though using subqueries to select attributes results in messy indentation; see below).
- Keywords are capitalized according to common practice.
- Column names are pretty clear and meaningful.
Some problems are:
- Repetition, such as the threshold calculation.
- Cryptic table aliases, such as
t,c, andy.
- Many of your subqueries are to select an attribute (
SELECT (SELECT …) AS attribute). Try to formulate your subselects to appear in theFROMclause (SELECT subsel.attr FROM (SELECT attr FROM …)), which will make your indentation saner and the query generally more readable.
-
Convoluted reasoning that looks almost circular. For example, in the first two dozen lines…
SELECT c.vote_candidate, (
SELECT (
(MAX(votes_above_the_threshold) - (
SELECT FLOOR((COUNT(*) / (2 + 1)) + 1) threshold
FROM votes
)) / MAX(votes_above_the_threshold)
) ratio
FROM (
SELECT vote_candidate vote_candidate, COUNT(*) votes_above_the_threshold
FROM vote_orders
WHERE vote_order = 1
GROUP BY vote_candidate
HAVING votes_above_the_threshold >= (
SELECT FLOOR((COUNT(*) / (2 + 1)) + 1) threshold
FROM votes
)
) t
WHERE votes_above_the_threshold = (
SELECT MAX(votes_above_the_threshold)
FROM vote_orders
)
) surplus_ratio,votes_above_the_threshold is defined as a column of t. t is filtered through a HAVING clause. It is further filtered by WHERE votes_about_the_threshold = (SELECT MAX(votes_above_the_threshold) FROM vote_orders), bCode Snippets
CREATE TABLE ballots
( id INT NOT NULL AUTO_INCREMENT PRIMARY KEY
, choice1 INT
, choice2 INT
, choice3 INT
, choice4 INT
, choice5 INT
, choice6 INT
);CREATE TABLE votes
( ballot INT NOT NULL
, rank INT NOT NULL
, candidate INT NOT NULL
, UNIQUE (ballot, rank)
, UNIQUE (ballot, candidate)
);WHERE votes_above_the_threshold = (
SELECT MAX(votes_above_the_threshold)
FROM vote_orders
)SELECT c.vote_candidate, (
SELECT (
(MAX(votes_above_the_threshold) - (
SELECT FLOOR((COUNT(*) / (2 + 1)) + 1) threshold
FROM votes
)) / MAX(votes_above_the_threshold)
) ratio
FROM (
SELECT vote_candidate vote_candidate, COUNT(*) votes_above_the_threshold
FROM vote_orders
WHERE vote_order = 1
GROUP BY vote_candidate
HAVING votes_above_the_threshold >= (
SELECT FLOOR((COUNT(*) / (2 + 1)) + 1) threshold
FROM votes
)
) t
WHERE votes_above_the_threshold = (
SELECT MAX(votes_above_the_threshold)
FROM vote_orders
)
) surplus_ratio,WITH quota AS (
SELECT FLOOR(COUNT(*) / (2 + 1)) + 1 AS threshold
FROM votes
), preferences AS (
SELECT 1 AS round
, NULL AS preferred_candidate
, vote_candidate AS candidate
, COUNT(*) AS votes
, CAST(NULL AS INTEGER) AS elected_in_round
FROM vote_orders
WHERE vote_order = 1
GROUP BY vote_candidate
UNION ALL
SELECT fallback.vote_order AS round
, preferred.vote_candidate AS preferred_candidate
, fallback.vote_candidate AS candidate
, COUNT(*) AS votes
, NULL AS elected_in_round
FROM vote_orders AS preferred
JOIN vote_orders AS fallback
ON preferred.vote_id = fallback.vote_id
AND preferred.vote_order + 1 = fallback.vote_order
GROUP BY fallback.vote_order, preferred.vote_candidate, fallback.vote_candidate
), electees AS (
SELECT preferences.round
, preferences.preferred_candidate
, preferences.candidate
, preferences.votes
, round AS elected_in_round
, votes - quota.threshold AS surplus
FROM preferences, quota
WHERE votes >= quota.threshold
AND round = (SELECT MIN(round) FROM preferences)
), per_electee_redistribution AS (
SELECT this_round.round AS round
, this_round.candidate AS preferred_candidate
, next_round.candidate AS candidate
, this_round.votes AS preferred_votes
, this_round.surplus AS surplus
, next_round.votes AS fallback_votes
, this_round.surplus * next_round.votes / (this_round.votes + 0.0) AS redistribution
FROM electees AS this_round
LEFT OUTER JOIN preferences AS next_round
ON this_round.round + 1 = next_round.round
AND this_round.candidate = next_round.preferred_candidate
WHERE this_round.elected_in_round = this_round.round
), redistribution AS (
SELECT round
, candidate AS beneficiary
, SUM(redistribution) AS redistribution
FROM per_electee_redistribution
GROUP BY round, candidate
UNION ALL
SELECT round
, candidate AS beneficiary
, -surplus
FROM electees
WHERE elected_in_round = round
), redistributed AS (
SELECT preferences.round AS round
, preferences.preferred_candidate
, preferences.candidate
, preferences.votes
, COALESCE(redistribution.redistribution, 0) AS redistribution
, preferences.votes + COALESCE(redistribution.redistribution, 0) AS votes_after_redistribution
FROM preferences
LEFT OUTER JOIN redistribution
ON preferences.round = redistribution.round
AND preferences.candidate = redistribution.beneficiary
), raw_next_round AS (
SELECT round
, NULL AS preferred_candidate
, candidate
, SUM(votes) AS votes
, NULL AS elected_in_round
Context
StackExchange Code Review Q#54099, answer score: 19
Revisions (0)
No revisions yet.