HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

How to optimize multiple self-JOINs?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
optimizemultiplehowjoinsself

Problem

I'm looking for advice on either optimizing multiple self-joins, or a better table/DB design.

One of the tables looks as follows (relevant cols only):

CREATE TABLE IF NOT EXISTS CountryData (
  countryDataID INT PRIMARY KEY AUTO_INCREMENT,
  dataID INT NOT NULL REFERENCES DataSources (dataID),
  dataCode VARCHAR(30) NULL,
  countryID INT NOT NULL REFERENCES Countries (countryID),
  year INT NOT NULL ,
  data DEC(20,4) NULL,
  INDEX countryDataYear (dataID, countryID, year));


The data column has values for a few hundred indicators, 90 countries, and 30 years for ~1mn rows total. A standard query requires selecting N indicators for a particular year and C countries, yielding a CxN table for 90 rows max.

With all values in a single column, self-joins seemed like the way to go. So I have experimented with various suggestions to speed those up, including indexing and creating new (temp) tables. At 9 self-joins, the query takes a little under 1 min. Beyond that, it spins forever.

The new table from where the self-joins take place has only about 1,000 rows, indexed on what seem to be the relevant variables - creation takes about 0.5 sec:

CREATE TABLE Growth
    SELECT dataID, countryID, year, data
    FROM CountryData
    WHERE dataID > 522 AND year = 2017;

CREATE INDEX growth_ix 
    ON Growth (dataID, countryID);


The SELECT query then arranges up to XX indicators in the results table, with XX unfortunately <10:

```
SELECT
Countries.countryName AS Country,
em01.em,
em02.em,
em03.em
...
emX.em
FROM
(SELECT
em1.data AS em,
em1.countryID
FROM Growth AS em1
WHERE
em1.dataID = 523) as em01
JOIN
(SELECT
em2.data AS em,
em2.countryID
FROM Growth AS em2
WHERE
em2.dataID = 524) as em02
USING (countryID)
JOIN
(SELECT
em3.data AS em,
em3.countryID
FROM Growth AS em3
WHERE
em3.dataID = 525) as em03
USING (count

Solution

I don't think there is a need for denormalizing the table. Self-joining will work fine if you have indexes that will be used effectively. For the specific query, I'd add an index on (year, dataID, countryID, data) and not use any derived tables:

SELECT 
    c.countryName AS Country,   
    em01.data AS data01,
    em02.data AS data02,
    ...
    emXX.data AS dataXX
FROM  
    Countries AS c
      JOIN CountryData AS em01
        ON  em01.year = 2017
        AND em01.dataID = 523 
        AND em01.countryID = c.countryID 
      JOIN CountryData AS em02
        ON  em02.year = 2017
        AND em02.dataID = 524
        AND em02.countryID = c.countryID
      ... 
      JOIN CountryData AS emXX
        ON  emXX.year = 2017
        AND emXX.dataID = YYY
        AND emXX.countryID = c.countryID 
 ;


The only issue you'll have is that MySQL has a hard limit of maximum 61 joins in a query. So, you won't be able to have 90 columns with the above query.

Another observation is that you don't seem to need any aggregated data but just a tiny subset of the (small or big table, doesn't matter). With the suggested index, you could just write a query like this:

SELECT 
    cd.countryID,
    c.countryName AS Country,   
    cd.dataID,
    cd.data
FROM  
    Countries AS c
      JOIN CountryData AS cd
        ON  cd.countryID = c.countryID 
WHERE 
      cd.year = 2017
  AND cd.dataID IN (522, 523, ..., YYY) 
ORDER BY
    cd.countryID,
    cd.dataID ;


and have the pivot transformation done in your application.

Code Snippets

SELECT 
    c.countryName AS Country,   
    em01.data AS data01,
    em02.data AS data02,
    ...
    emXX.data AS dataXX
FROM  
    Countries AS c
      JOIN CountryData AS em01
        ON  em01.year = 2017
        AND em01.dataID = 523 
        AND em01.countryID = c.countryID 
      JOIN CountryData AS em02
        ON  em02.year = 2017
        AND em02.dataID = 524
        AND em02.countryID = c.countryID
      ... 
      JOIN CountryData AS emXX
        ON  emXX.year = 2017
        AND emXX.dataID = YYY
        AND emXX.countryID = c.countryID 
 ;
SELECT 
    cd.countryID,
    c.countryName AS Country,   
    cd.dataID,
    cd.data
FROM  
    Countries AS c
      JOIN CountryData AS cd
        ON  cd.countryID = c.countryID 
WHERE 
      cd.year = 2017
  AND cd.dataID IN (522, 523, ..., YYY) 
ORDER BY
    cd.countryID,
    cd.dataID ;

Context

StackExchange Database Administrators Q#43651, answer score: 4

Revisions (0)

No revisions yet.