patternsqlMinor
Simplifying/Improving inner join (self joining a table)
Viewed 0 times
tablejoinsimplifyinginnerimprovingjoiningself
Problem
I have table which has the following schema:
And I have constructed a query that joins several answer rows into one, uses some of them, (age in this case) as filters:
The output is following:
I'm wondering if the same query can be accomplished in a more efficient way in terms of query time, or if it can be simplified. I'm currently using sqlite3 as the test database, but if there's a way to simplify the query in for example MySQL, that would also be an acceptable solution.
For reference, the answer table contains these values (where datasetID is 'test'), i.e. the results of the query
```
DatasetID QuestionID PersonID answer
---------- ---------- ---------- ----------
test year 12345 1992
test country 12345 Austria
test age 12345 24
cid name type notnull dflt_value pk
---------- ---------- ----------- ---------- ---------- ----------
0 DatasetID VARCHAR(15) 0 1
1 QuestionID VARCHAR(20) 0 2
2 PersonID INTEGER 0 3
3 answer text 0 0And I have constructed a query that joins several answer rows into one, uses some of them, (age in this case) as filters:
SELECT i.answer AS graduation_year, j.answer AS education, k.answer AS country FROM
(SELECT * FROM answer
WHERE QuestionID = 'year'
AND answer <> ''
AND DatasetID = 'test') i
inner join answer j
on i.PersonID = j.PersonID
AND i.DatasetID = j.DatasetID
AND j.QuestionID = 'education'
inner join answer k
on i.PersonID = k.PersonID
AND i.datasetID = k.datasetID
AND k.QuestionID = 'country'
inner join answer m
on i.PersonID = m.PersonID
AND i.datasetID = m.datasetID
AND m.QuestionID = 'age'
AND CAST(m.answer as NUMERIC) > 25;The output is following:
graduation_year education country
--------------- ---------- ----------
1995 Bachelors Finland
1980 PhD eng. AustriaI'm wondering if the same query can be accomplished in a more efficient way in terms of query time, or if it can be simplified. I'm currently using sqlite3 as the test database, but if there's a way to simplify the query in for example MySQL, that would also be an acceptable solution.
For reference, the answer table contains these values (where datasetID is 'test'), i.e. the results of the query
SELECT * FROM answer WHERE datasetID = 'test';```
DatasetID QuestionID PersonID answer
---------- ---------- ---------- ----------
test year 12345 1992
test country 12345 Austria
test age 12345 24
Solution
When you need to retrieve rather many attributes from an EAV structure, you can try pivoting it with grouping & conditional aggregation as an alternative to multiple self-joins:
That avoids multiple table scans and, as a bonus, makes your query shorter. But how much the former will affect the performance can ultimately be determined only by testing.
Still, a minor (or not – depends on the total number of attributes) improvement to the above could be to reduce the underlying dataset to just the attributes you are actually retrieving, like this:
Logically, the above is still not completely equivalent to the series of inner joins because the latter would give you only entries that have all specified attributes. The method in this answer can return entries that do not have some of the attributes requested. If you would like to exclude those, add a HAVING filter that makes sure the number of attributes matches the WHERE filter:
Your query also has a filter on an attribute that in this case is an aggregated column. Such filters would need to be implemented in the HAVING clause. For your example, the filter on age would look like this:
Again, you will need to test the final query in your environment to see how much, if at all, these additional modifications affect the performance.
SELECT
PersonID,
MAX(CASE QuestionID WHEN 'year' THEN answer END) AS year,
MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
MAX(CASE QuestionID WHEN 'country' THEN answer END) AS country,
MAX(CASE QuestionID WHEN 'age' THEN answer END) AS age
FROM
answer
WHERE
DatasetID = 'test'
GROUP BY
PersonID
;That avoids multiple table scans and, as a bonus, makes your query shorter. But how much the former will affect the performance can ultimately be determined only by testing.
Still, a minor (or not – depends on the total number of attributes) improvement to the above could be to reduce the underlying dataset to just the attributes you are actually retrieving, like this:
SELECT
PersonID,
MAX(CASE QuestionID WHEN 'year' THEN answer END) AS year,
MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
MAX(CASE QuestionID WHEN 'country' THEN answer END) AS country,
MAX(CASE QuestionID WHEN 'age' THEN answer END) AS age
FROM
answer
WHERE
DatasetID = 'test'
AND QuestionID IN ('year', 'education', 'country', 'age')
GROUP BY
PersonID
;Logically, the above is still not completely equivalent to the series of inner joins because the latter would give you only entries that have all specified attributes. The method in this answer can return entries that do not have some of the attributes requested. If you would like to exclude those, add a HAVING filter that makes sure the number of attributes matches the WHERE filter:
SELECT
PersonID,
MAX(CASE QuestionID WHEN 'year' THEN answer END) AS year,
MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
MAX(CASE QuestionID WHEN 'country' THEN answer END) AS country,
MAX(CASE QuestionID WHEN 'age' THEN answer END) AS age
FROM
answer
WHERE
DatasetID = 'test'
AND QuestionID IN ('year', 'education', 'country', 'age')
GROUP BY
PersonID
HAVING
COUNT(*) = 4
;Your query also has a filter on an attribute that in this case is an aggregated column. Such filters would need to be implemented in the HAVING clause. For your example, the filter on age would look like this:
SELECT
PersonID,
MAX(CASE QuestionID WHEN 'year' THEN answer END) AS year,
MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
MAX(CASE QuestionID WHEN 'country' THEN answer END) AS country,
MAX(CASE QuestionID WHEN 'age' THEN answer END) AS age
FROM
answer
WHERE
DatasetID = 'test'
AND QuestionID IN ('year', 'education', 'country', 'age')
GROUP BY
PersonID
HAVING
COUNT(*) = 4
AND CAST(MAX(CASE questionID WHEN 'age' THEN answer END) AS numeric) > 25
;Again, you will need to test the final query in your environment to see how much, if at all, these additional modifications affect the performance.
Code Snippets
SELECT
PersonID,
MAX(CASE QuestionID WHEN 'year' THEN answer END) AS year,
MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
MAX(CASE QuestionID WHEN 'country' THEN answer END) AS country,
MAX(CASE QuestionID WHEN 'age' THEN answer END) AS age
FROM
answer
WHERE
DatasetID = 'test'
GROUP BY
PersonID
;SELECT
PersonID,
MAX(CASE QuestionID WHEN 'year' THEN answer END) AS year,
MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
MAX(CASE QuestionID WHEN 'country' THEN answer END) AS country,
MAX(CASE QuestionID WHEN 'age' THEN answer END) AS age
FROM
answer
WHERE
DatasetID = 'test'
AND QuestionID IN ('year', 'education', 'country', 'age')
GROUP BY
PersonID
;SELECT
PersonID,
MAX(CASE QuestionID WHEN 'year' THEN answer END) AS year,
MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
MAX(CASE QuestionID WHEN 'country' THEN answer END) AS country,
MAX(CASE QuestionID WHEN 'age' THEN answer END) AS age
FROM
answer
WHERE
DatasetID = 'test'
AND QuestionID IN ('year', 'education', 'country', 'age')
GROUP BY
PersonID
HAVING
COUNT(*) = 4
;SELECT
PersonID,
MAX(CASE QuestionID WHEN 'year' THEN answer END) AS year,
MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
MAX(CASE QuestionID WHEN 'country' THEN answer END) AS country,
MAX(CASE QuestionID WHEN 'age' THEN answer END) AS age
FROM
answer
WHERE
DatasetID = 'test'
AND QuestionID IN ('year', 'education', 'country', 'age')
GROUP BY
PersonID
HAVING
COUNT(*) = 4
AND CAST(MAX(CASE questionID WHEN 'age' THEN answer END) AS numeric) > 25
;Context
StackExchange Database Administrators Q#155453, answer score: 4
Revisions (0)
No revisions yet.