HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

PostgreSQL multiple processes and queries or nested query

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
postgresqlqueryandnestedmultipleprocessesqueries

Problem

I have a series of tables and queries I run a process against and at times this process can take hours - if not multiple days - depending on the date range I select. I am running PostgreSQL 9.4.

In reviewing the below queries, does it make sense to

  • Have a temp table and/or



  • Should I have a nested query?



  • Is there a better design to improve the structure?



I realize there is some back-end tuning I can optimize, but in just reviewing the below queries is there a glaring improvement I am missing or should be addressing better?

The first query:

  • I first insert the below results into a temp table



  • Running the below query can take hours



  • Then the insert can sometimes take an hour or more depending on the amount of data



  • The average amount of rows the can be insert are around two million or more.



  • There is no way to avoid this given the data set and requested results



```
INSERT INTO tempresults(
lmp_date, approved_lmp_name_a, approved_lmp_name_b,
path, approved_a_pnode,
approved_b_pnode, approved_a_sink, approved_b_sink, approved_a_source,
approved_b_source,
lmp_da_a, lmp_da_b, da_spread_diff, lmp_da_spread_pl,
submitted_bid, filled, filled_percentage, priced_filled_at, filled_vs_settled,
profit_flag, filled_profit, loss_flag, filled_loss, lmp_rt_a,
lmp_rt_b, rt_spread_diff, dart_a, dart_b, dart_spread_diff, win_trigger, loss_trigger)

SELECT
"Selected_Period".lmp_date as "Date",
"Selected_Period".lmp_name_a as "LMP Name A",
"Selected_Period".lmp_name_b as "LMP Name B",
CONCAT("Selected_Period".lmp_name_a,' - ',"Selected_Period".lmp_name_b) as "Path",
"Selected_Period".approved_a_pnode, "Selected_Period".approved_b_pnode,
"Selected_Period".approved_a_sink, "Selected_Period".approved_b_sink, "Selected_Period".approved_a_source,
"Selected_Period".approved_b_source,
AVG("Selected_Period".lmp_da_a) as "LMP DA A Avg",
AVG("Selected_Period".lmp

Solution

Style

Your style of writing SQL makes it very difficult to read... for example:

INSERT INTO tempresults(
            lmp_date, approved_lmp_name_a, approved_lmp_name_b, 
            path, approved_a_pnode, 
            approved_b_pnode, approved_a_sink, approved_b_sink, approved_a_source, 
            approved_b_source, 
            lmp_da_a, lmp_da_b, da_spread_diff, lmp_da_spread_pl, 
            submitted_bid, filled, filled_percentage, priced_filled_at, filled_vs_settled, 
            profit_flag, filled_profit, loss_flag, filled_loss, lmp_rt_a, 
            lmp_rt_b, rt_spread_diff, dart_a, dart_b, dart_spread_diff, win_trigger, loss_trigger)


Would read much more easily as:

INSERT INTO tempresults (

            lmp_date, 
            approved_lmp_name_a, 
            approved_lmp_name_b, 
            path, 
            approved_a_pnode, 
            approved_b_pnode, 
            approved_a_sink, 
            approved_b_sink, 
            approved_a_source, 
            approved_b_source, 
            lmp_da_a, 
            lmp_da_b, 
            da_spread_diff, 
            lmp_da_spread_pl, 
            submitted_bid, 
            filled, 
            filled_percentage,  
            priced_filled_at,  
            filled_vs_settled, 
            profit_flag,  
            filled_profit,  
            loss_flag,  
            filled_loss,  
            lmp_rt_a, 
            lmp_rt_b,  
            rt_spread_diff,  
            dart_a,  
            dart_b,  
            dart_spread_diff,  
            win_trigger,  
            loss_trigger
)


I realize this is a matter of style and doesn't affect code performance, but it's easier to maintain that way. Another example:

CASE WHEN ("Selected_Period".da_spread_diff*-1)0.50 THEN SUM((("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)))+("Selected_Period".da_spread_diff) ELSE 0 END END AS "Filled Profit",


This is a bit of a nightmare to scroll horizontally to understand what is going on... How about something like this:

CASE 
  WHEN ("Selected_Period".da_spread_diff*-1)0.50 
    THEN SUM((("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)))+("Selected_Period".da_spread_diff) 
    ELSE 0 
    END 
  END AS "Filled Profit",


Altogether I think it could use some documentation along with it, so the next person who has to maintain this does not go too crazy if it needs refactored. You could also do the same with your GROUP BY statement.

Stored functions

This part of the insert script caught my attention:

WHERE 
"Selected_Period".lmp_date >= 'XYXYXY' AND  -- starting date. user normally inputs this value
"Selected_Period".lmp_date <= 'YZYZYZ' AND -- ending date. user normally inputs this value
"Selected_Period".he = '5'  --this is hour. user normally inputs this value


So we have user-input values. Great! This would be a good candidate for a stored function!

Here is an example:

CREATE FUNCTION insert_into_your_table (
    start_date DATE,
    end_date DATE,
    selected_hours INT -- or perhaps TEXT in your case
) AS $ 
BEGIN

/* 
 * massive INSERT INTO .. SELECT statement here
 */

WHERE "Selected_Period".lmp_date >= start_date
  AND  "Selected_Period".lmp_date <= end_date
  AND  "Selected_Period".he = selected_hours
/* GROUP BY stuff... */
END;
$ LANGUAGE plpgsql;


Notice how the use of variables as input parameters also eliminates the need to explain what they are for. Then once it needs to be ran, just do:

SELECT insert_into_your_table (
    start_date := '2015-01-01',
    end_date := '2015-03-02',
    selected_hours := '5'
);


This will store the execution plan along with your query, making it faster to execute after you have called it once. You could do the same with your big SELECT query, without input parameters this time.

Calculations

I feel that your insert query has a rather... unhealthy amount of AVG calculations. AVG is expensive because it does both a SUM and COUNT together. The whole logic of your calculations seems terribly convoluted. It's hard to say how to improve it without having access to your data source, but I think it would be worth your while to take a piece of paper and try to draw out everything your query is doing (and then you'll have an idea just how convoluted it actually is).

A few things pop out immediately...

0.50 AS "Submitted Bid",


This 0.50 value is constantly used in your insert query. I'd suggest you declare it at the top and use it throughout...

submitted_bid DECIMAL(3,2) := 0.50;


Then, you write it more like:

submitted_bid AS "Submitted Bid",
CASE 
  WHEN ("Selected_Period".da_spread_diff*-1) <= submitted_bid 
  THEN TRUE 
  ELSE FALSE 
END AS "Filled",
-- etc.


This one comes up constantly in your insert query:

("Selected_Period".da_spread_diff*-1)


Put that one in a variable too... (sorry, data type is unclear, use whatever is appropriate)

```
period_

Code Snippets

INSERT INTO tempresults(
            lmp_date, approved_lmp_name_a, approved_lmp_name_b, 
            path, approved_a_pnode, 
            approved_b_pnode, approved_a_sink, approved_b_sink, approved_a_source, 
            approved_b_source, 
            lmp_da_a, lmp_da_b, da_spread_diff, lmp_da_spread_pl, 
            submitted_bid, filled, filled_percentage, priced_filled_at, filled_vs_settled, 
            profit_flag, filled_profit, loss_flag, filled_loss, lmp_rt_a, 
            lmp_rt_b, rt_spread_diff, dart_a, dart_b, dart_spread_diff, win_trigger, loss_trigger)
INSERT INTO tempresults (

            lmp_date, 
            approved_lmp_name_a, 
            approved_lmp_name_b, 
            path, 
            approved_a_pnode, 
            approved_b_pnode, 
            approved_a_sink, 
            approved_b_sink, 
            approved_a_source, 
            approved_b_source, 
            lmp_da_a, 
            lmp_da_b, 
            da_spread_diff, 
            lmp_da_spread_pl, 
            submitted_bid, 
            filled, 
            filled_percentage,  
            priced_filled_at,  
            filled_vs_settled, 
            profit_flag,  
            filled_profit,  
            loss_flag,  
            filled_loss,  
            lmp_rt_a, 
            lmp_rt_b,  
            rt_spread_diff,  
            dart_a,  
            dart_b,  
            dart_spread_diff,  
            win_trigger,  
            loss_trigger
)
CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN  CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50 THEN SUM((("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)))+("Selected_Period".da_spread_diff) ELSE 0 END END AS "Filled Profit",
CASE 
  WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 
  THEN  
    CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50 
    THEN SUM((("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)))+("Selected_Period".da_spread_diff) 
    ELSE 0 
    END 
  END AS "Filled Profit",
WHERE 
"Selected_Period".lmp_date >= 'XYXYXY' AND  -- starting date. user normally inputs this value
"Selected_Period".lmp_date <= 'YZYZYZ' AND -- ending date. user normally inputs this value
"Selected_Period".he = '5'  --this is hour. user normally inputs this value

Context

StackExchange Code Review Q#83044, answer score: 6

Revisions (0)

No revisions yet.