HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Simplifying/Improving inner join (self joining a table)

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
tablejoinsimplifyinginnerimprovingjoiningself

Problem

I have table which has the following schema:

cid         name        type         notnull     dflt_value  pk
----------  ----------  -----------  ----------  ----------  ----------
0           DatasetID   VARCHAR(15)  0                       1
1           QuestionID  VARCHAR(20)  0                       2
2           PersonID    INTEGER      0                       3
3           answer      text         0                       0


And I have constructed a query that joins several answer rows into one, uses some of them, (age in this case) as filters:

SELECT i.answer AS graduation_year, j.answer AS education, k.answer AS country FROM

(SELECT * FROM answer
WHERE QuestionID = 'year'
AND answer <> ''
AND DatasetID = 'test') i

inner join answer j
    on i.PersonID = j.PersonID
    AND i.DatasetID = j.DatasetID
    AND j.QuestionID = 'education'

inner join answer k
    on i.PersonID = k.PersonID
    AND i.datasetID = k.datasetID
    AND k.QuestionID = 'country'

inner join answer m
    on i.PersonID = m.PersonID
    AND i.datasetID = m.datasetID
    AND m.QuestionID = 'age'
    AND CAST(m.answer as NUMERIC) > 25;


The output is following:

graduation_year  education   country
---------------  ----------  ----------
1995             Bachelors   Finland
1980             PhD eng.    Austria


I'm wondering if the same query can be accomplished in a more efficient way in terms of query time, or if it can be simplified. I'm currently using sqlite3 as the test database, but if there's a way to simplify the query in for example MySQL, that would also be an acceptable solution.

For reference, the answer table contains these values (where datasetID is 'test'), i.e. the results of the query SELECT * FROM answer WHERE datasetID = 'test';

```
DatasetID QuestionID PersonID answer
---------- ---------- ---------- ----------
test year 12345 1992
test country 12345 Austria
test age 12345 24

Solution

When you need to retrieve rather many attributes from an EAV structure, you can try pivoting it with grouping & conditional aggregation as an alternative to multiple self-joins:

SELECT
  PersonID,
  MAX(CASE QuestionID WHEN 'year'      THEN answer END) AS year,
  MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
  MAX(CASE QuestionID WHEN 'country'   THEN answer END) AS country,
  MAX(CASE QuestionID WHEN 'age'       THEN answer END) AS age
FROM
  answer
WHERE
  DatasetID = 'test'
GROUP BY
  PersonID
;


That avoids multiple table scans and, as a bonus, makes your query shorter. But how much the former will affect the performance can ultimately be determined only by testing.

Still, a minor (or not – depends on the total number of attributes) improvement to the above could be to reduce the underlying dataset to just the attributes you are actually retrieving, like this:

SELECT
  PersonID,
  MAX(CASE QuestionID WHEN 'year'      THEN answer END) AS year,
  MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
  MAX(CASE QuestionID WHEN 'country'   THEN answer END) AS country,
  MAX(CASE QuestionID WHEN 'age'       THEN answer END) AS age
FROM
  answer
WHERE
  DatasetID = 'test'
  AND QuestionID IN ('year', 'education', 'country', 'age')
GROUP BY
  PersonID
;


Logically, the above is still not completely equivalent to the series of inner joins because the latter would give you only entries that have all specified attributes. The method in this answer can return entries that do not have some of the attributes requested. If you would like to exclude those, add a HAVING filter that makes sure the number of attributes matches the WHERE filter:

SELECT
  PersonID,
  MAX(CASE QuestionID WHEN 'year'      THEN answer END) AS year,
  MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
  MAX(CASE QuestionID WHEN 'country'   THEN answer END) AS country,
  MAX(CASE QuestionID WHEN 'age'       THEN answer END) AS age
FROM
  answer
WHERE
  DatasetID = 'test'
  AND QuestionID IN ('year', 'education', 'country', 'age')
GROUP BY
  PersonID
HAVING
  COUNT(*) = 4
;


Your query also has a filter on an attribute that in this case is an aggregated column. Such filters would need to be implemented in the HAVING clause. For your example, the filter on age would look like this:

SELECT
  PersonID,
  MAX(CASE QuestionID WHEN 'year'      THEN answer END) AS year,
  MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
  MAX(CASE QuestionID WHEN 'country'   THEN answer END) AS country,
  MAX(CASE QuestionID WHEN 'age'       THEN answer END) AS age
FROM
  answer
WHERE
  DatasetID = 'test'
  AND QuestionID IN ('year', 'education', 'country', 'age')
GROUP BY
  PersonID
HAVING
  COUNT(*) = 4
  AND CAST(MAX(CASE questionID WHEN 'age' THEN answer END) AS numeric) > 25
;


Again, you will need to test the final query in your environment to see how much, if at all, these additional modifications affect the performance.

Code Snippets

SELECT
  PersonID,
  MAX(CASE QuestionID WHEN 'year'      THEN answer END) AS year,
  MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
  MAX(CASE QuestionID WHEN 'country'   THEN answer END) AS country,
  MAX(CASE QuestionID WHEN 'age'       THEN answer END) AS age
FROM
  answer
WHERE
  DatasetID = 'test'
GROUP BY
  PersonID
;
SELECT
  PersonID,
  MAX(CASE QuestionID WHEN 'year'      THEN answer END) AS year,
  MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
  MAX(CASE QuestionID WHEN 'country'   THEN answer END) AS country,
  MAX(CASE QuestionID WHEN 'age'       THEN answer END) AS age
FROM
  answer
WHERE
  DatasetID = 'test'
  AND QuestionID IN ('year', 'education', 'country', 'age')
GROUP BY
  PersonID
;
SELECT
  PersonID,
  MAX(CASE QuestionID WHEN 'year'      THEN answer END) AS year,
  MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
  MAX(CASE QuestionID WHEN 'country'   THEN answer END) AS country,
  MAX(CASE QuestionID WHEN 'age'       THEN answer END) AS age
FROM
  answer
WHERE
  DatasetID = 'test'
  AND QuestionID IN ('year', 'education', 'country', 'age')
GROUP BY
  PersonID
HAVING
  COUNT(*) = 4
;
SELECT
  PersonID,
  MAX(CASE QuestionID WHEN 'year'      THEN answer END) AS year,
  MAX(CASE QuestionID WHEN 'education' THEN answer END) AS education,
  MAX(CASE QuestionID WHEN 'country'   THEN answer END) AS country,
  MAX(CASE QuestionID WHEN 'age'       THEN answer END) AS age
FROM
  answer
WHERE
  DatasetID = 'test'
  AND QuestionID IN ('year', 'education', 'country', 'age')
GROUP BY
  PersonID
HAVING
  COUNT(*) = 4
  AND CAST(MAX(CASE questionID WHEN 'age' THEN answer END) AS numeric) > 25
;

Context

StackExchange Database Administrators Q#155453, answer score: 4

Revisions (0)

No revisions yet.