HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Improving performance of nasty nested-view joins

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
viewnestednastyperformanceimprovingjoins

Problem

I have a moderate-sized database spread out over a few tables, the rough architecture is:

  • Input Data (Data ID, Session ID and a few fields of statistical importance)



  • Input File (Data ID and a blob)



  • Stage 1 Output File (Data ID and a blob)



  • Stage 2 output File (Data ID and a blob)



  • Category 1 results (Data ID and some booleans)



  • Category 2 results (Data ID and some integers)



  • Category 3 results (Data ID and some integers)



Each table has ~200,000 rows.

I also have a view that basically glues all of these together so that I can SELECT a bunch of IDs (usually selecting them based on the Session ID) and view all the relevant data on one page.

The view works, and the query plan's index utilization seems sane, but the results are less-than-fast:

```
> EXPLAIN ANALYZE SELECT(*) FROM overlay WHERE test_session=12345;

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Right Join (cost=7.19..74179.49 rows=10 width=305) (actual time=10680.129..10680.494 rows=4 loops=1)
Merge Cond: (p.data_id = d.id)
-> Merge Join (cost=7.19..75077.04 rows=183718 width=234) (actual time=0.192..10434.995 rows=173986 loops=1)
Merge Cond: (p.data_id = input_file.data_id)
-> Merge Join (cost=7.19..69917.74 rows=183718 width=222) (actual time=0.173..9255.653 rows=173986 loops=1)
Merge Cond: (p.data_id = stage1_output_file.data_id)
-> Merge Join (cost=5.50..62948.54 rows=183718 width=186) (actual time=0.153..8081.949 rows=173986 loops=1)
Merge Cond: (p.data_id = stage2_output_file.data_id)
-> Merge Join (cost=3.90..55217.36 rows=183723 width=150) (actual time=0.132..6918.814 rows=173986 loops=1)
Merge Cond: (p.data_id = stage3_output_file.data_id)

Solution

I am speculating here, but I guess the fact that you LEFT JOIN to the view makes the planner compute the results from the view as a whole before joining to the first part of the query.

Try inlining the query from the view and make it a JOIN instead of a LEFT JOIN, just to see if the planner finds a faster way now:

SELECT d.data_id, d.test_session, d.a, d.b, t.c, d.d, d.e, d.f
     , p.data_id AS p_data_id, p.x2, c.str AS impression, i.h
     , p.x3, p.x3, p.x4
     , s.x5, s.x6, s.x7, s.x8, s.x9, s.x10, s.x11, s.x12, s.x13, s.x14
     , t.x15, t.x16, t.x17, t.x18, t.x19, t.x20, t.x21, t.x22, t.x23
     , i.data AS input
     , s1.data AS stage1, s2.data AS stage2, s3.data AS stage3
FROM   input_data d
JOIN   category1_results        p ON p.data_id = d.data_id
JOIN   input_file               i USING (data_id)
JOIN   stage1_output_file      s1 USING (data_id)
JOIN   stage2_output_file      s2 USING (data_id)
JOIN   stage3_output_file      s3 USING (data_id)
JOIN   category2_results        s USING (data_id)
JOIN   category3_results        t USING (data_id)
JOIN   category1_impression_str c ON p.impression = c.id 
LEFT   JOIN quality_codes       t ON t.id = d.input_quality
WHERE  NOT d.deleted;


I cleaned up your syntax to make it more manageable. Added an alias for the second data_id column, so it can execute.

If that should lead to a considerably faster execution time, you can try and add missing rows due to the INNER JOIN like this:

SELECT DISTINCT ON (1,2,3,4,5,6,7,8) *
FROM (
    >
    ) x
UNION ALL
SELECT d.data_id, d.test_session, d.a, d.b, t.c, d.d, d.e, d.f
      ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
      ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
      ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
FROM   input_data d
LEFT   JOIN quality_codes t ON t.id = d.input_quality
WHERE  NOT d.deleted;
ORDER  BY 1,2,3,4,5,6,7,8, 9 NULLS LAST; -- p.data_id is otherwise not null

Code Snippets

SELECT d.data_id, d.test_session, d.a, d.b, t.c, d.d, d.e, d.f
     , p.data_id AS p_data_id, p.x2, c.str AS impression, i.h
     , p.x3, p.x3, p.x4
     , s.x5, s.x6, s.x7, s.x8, s.x9, s.x10, s.x11, s.x12, s.x13, s.x14
     , t.x15, t.x16, t.x17, t.x18, t.x19, t.x20, t.x21, t.x22, t.x23
     , i.data AS input
     , s1.data AS stage1, s2.data AS stage2, s3.data AS stage3
FROM   input_data d
JOIN   category1_results        p ON p.data_id = d.data_id
JOIN   input_file               i USING (data_id)
JOIN   stage1_output_file      s1 USING (data_id)
JOIN   stage2_output_file      s2 USING (data_id)
JOIN   stage3_output_file      s3 USING (data_id)
JOIN   category2_results        s USING (data_id)
JOIN   category3_results        t USING (data_id)
JOIN   category1_impression_str c ON p.impression = c.id 
LEFT   JOIN quality_codes       t ON t.id = d.input_quality
WHERE  NOT d.deleted;
SELECT DISTINCT ON (1,2,3,4,5,6,7,8) *
FROM (
    <<query>>
    ) x
UNION ALL
SELECT d.data_id, d.test_session, d.a, d.b, t.c, d.d, d.e, d.f
      ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
      ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
      ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
FROM   input_data d
LEFT   JOIN quality_codes t ON t.id = d.input_quality
WHERE  NOT d.deleted;
ORDER  BY 1,2,3,4,5,6,7,8, 9 NULLS LAST; -- p.data_id is otherwise not null

Context

StackExchange Database Administrators Q#17126, answer score: 5

Revisions (0)

No revisions yet.