patternsqlMinor
PostgreSQL multiple processes and queries or nested query
Viewed 0 times
postgresqlqueryandnestedmultipleprocessesqueries
Problem
I have a series of tables and queries I run a process against and at times this process can take hours - if not multiple days - depending on the date range I select. I am running PostgreSQL 9.4.
In reviewing the below queries, does it make sense to
I realize there is some back-end tuning I can optimize, but in just reviewing the below queries is there a glaring improvement I am missing or should be addressing better?
The first query:
```
INSERT INTO tempresults(
lmp_date, approved_lmp_name_a, approved_lmp_name_b,
path, approved_a_pnode,
approved_b_pnode, approved_a_sink, approved_b_sink, approved_a_source,
approved_b_source,
lmp_da_a, lmp_da_b, da_spread_diff, lmp_da_spread_pl,
submitted_bid, filled, filled_percentage, priced_filled_at, filled_vs_settled,
profit_flag, filled_profit, loss_flag, filled_loss, lmp_rt_a,
lmp_rt_b, rt_spread_diff, dart_a, dart_b, dart_spread_diff, win_trigger, loss_trigger)
SELECT
"Selected_Period".lmp_date as "Date",
"Selected_Period".lmp_name_a as "LMP Name A",
"Selected_Period".lmp_name_b as "LMP Name B",
CONCAT("Selected_Period".lmp_name_a,' - ',"Selected_Period".lmp_name_b) as "Path",
"Selected_Period".approved_a_pnode, "Selected_Period".approved_b_pnode,
"Selected_Period".approved_a_sink, "Selected_Period".approved_b_sink, "Selected_Period".approved_a_source,
"Selected_Period".approved_b_source,
AVG("Selected_Period".lmp_da_a) as "LMP DA A Avg",
AVG("Selected_Period".lmp
In reviewing the below queries, does it make sense to
- Have a temp table and/or
- Should I have a nested query?
- Is there a better design to improve the structure?
I realize there is some back-end tuning I can optimize, but in just reviewing the below queries is there a glaring improvement I am missing or should be addressing better?
The first query:
- I first insert the below results into a temp table
- Running the below query can take hours
- Then the insert can sometimes take an hour or more depending on the amount of data
- The average amount of rows the can be insert are around two million or more.
- There is no way to avoid this given the data set and requested results
```
INSERT INTO tempresults(
lmp_date, approved_lmp_name_a, approved_lmp_name_b,
path, approved_a_pnode,
approved_b_pnode, approved_a_sink, approved_b_sink, approved_a_source,
approved_b_source,
lmp_da_a, lmp_da_b, da_spread_diff, lmp_da_spread_pl,
submitted_bid, filled, filled_percentage, priced_filled_at, filled_vs_settled,
profit_flag, filled_profit, loss_flag, filled_loss, lmp_rt_a,
lmp_rt_b, rt_spread_diff, dart_a, dart_b, dart_spread_diff, win_trigger, loss_trigger)
SELECT
"Selected_Period".lmp_date as "Date",
"Selected_Period".lmp_name_a as "LMP Name A",
"Selected_Period".lmp_name_b as "LMP Name B",
CONCAT("Selected_Period".lmp_name_a,' - ',"Selected_Period".lmp_name_b) as "Path",
"Selected_Period".approved_a_pnode, "Selected_Period".approved_b_pnode,
"Selected_Period".approved_a_sink, "Selected_Period".approved_b_sink, "Selected_Period".approved_a_source,
"Selected_Period".approved_b_source,
AVG("Selected_Period".lmp_da_a) as "LMP DA A Avg",
AVG("Selected_Period".lmp
Solution
Style
Your style of writing SQL makes it very difficult to read... for example:
Would read much more easily as:
I realize this is a matter of style and doesn't affect code performance, but it's easier to maintain that way. Another example:
This is a bit of a nightmare to scroll horizontally to understand what is going on... How about something like this:
Altogether I think it could use some documentation along with it, so the next person who has to maintain this does not go too crazy if it needs refactored. You could also do the same with your
Stored functions
This part of the insert script caught my attention:
So we have user-input values. Great! This would be a good candidate for a stored function!
Here is an example:
Notice how the use of variables as input parameters also eliminates the need to explain what they are for. Then once it needs to be ran, just do:
This will store the execution plan along with your query, making it faster to execute after you have called it once. You could do the same with your big
Calculations
I feel that your insert query has a rather... unhealthy amount of
A few things pop out immediately...
This
Then, you write it more like:
This one comes up constantly in your insert query:
Put that one in a variable too... (sorry, data type is unclear, use whatever is appropriate)
```
period_
Your style of writing SQL makes it very difficult to read... for example:
INSERT INTO tempresults(
lmp_date, approved_lmp_name_a, approved_lmp_name_b,
path, approved_a_pnode,
approved_b_pnode, approved_a_sink, approved_b_sink, approved_a_source,
approved_b_source,
lmp_da_a, lmp_da_b, da_spread_diff, lmp_da_spread_pl,
submitted_bid, filled, filled_percentage, priced_filled_at, filled_vs_settled,
profit_flag, filled_profit, loss_flag, filled_loss, lmp_rt_a,
lmp_rt_b, rt_spread_diff, dart_a, dart_b, dart_spread_diff, win_trigger, loss_trigger)Would read much more easily as:
INSERT INTO tempresults (
lmp_date,
approved_lmp_name_a,
approved_lmp_name_b,
path,
approved_a_pnode,
approved_b_pnode,
approved_a_sink,
approved_b_sink,
approved_a_source,
approved_b_source,
lmp_da_a,
lmp_da_b,
da_spread_diff,
lmp_da_spread_pl,
submitted_bid,
filled,
filled_percentage,
priced_filled_at,
filled_vs_settled,
profit_flag,
filled_profit,
loss_flag,
filled_loss,
lmp_rt_a,
lmp_rt_b,
rt_spread_diff,
dart_a,
dart_b,
dart_spread_diff,
win_trigger,
loss_trigger
)I realize this is a matter of style and doesn't affect code performance, but it's easier to maintain that way. Another example:
CASE WHEN ("Selected_Period".da_spread_diff*-1)0.50 THEN SUM((("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)))+("Selected_Period".da_spread_diff) ELSE 0 END END AS "Filled Profit",This is a bit of a nightmare to scroll horizontally to understand what is going on... How about something like this:
CASE
WHEN ("Selected_Period".da_spread_diff*-1)0.50
THEN SUM((("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)))+("Selected_Period".da_spread_diff)
ELSE 0
END
END AS "Filled Profit",Altogether I think it could use some documentation along with it, so the next person who has to maintain this does not go too crazy if it needs refactored. You could also do the same with your
GROUP BY statement.Stored functions
This part of the insert script caught my attention:
WHERE
"Selected_Period".lmp_date >= 'XYXYXY' AND -- starting date. user normally inputs this value
"Selected_Period".lmp_date <= 'YZYZYZ' AND -- ending date. user normally inputs this value
"Selected_Period".he = '5' --this is hour. user normally inputs this valueSo we have user-input values. Great! This would be a good candidate for a stored function!
Here is an example:
CREATE FUNCTION insert_into_your_table (
start_date DATE,
end_date DATE,
selected_hours INT -- or perhaps TEXT in your case
) AS $
BEGIN
/*
* massive INSERT INTO .. SELECT statement here
*/
WHERE "Selected_Period".lmp_date >= start_date
AND "Selected_Period".lmp_date <= end_date
AND "Selected_Period".he = selected_hours
/* GROUP BY stuff... */
END;
$ LANGUAGE plpgsql;Notice how the use of variables as input parameters also eliminates the need to explain what they are for. Then once it needs to be ran, just do:
SELECT insert_into_your_table (
start_date := '2015-01-01',
end_date := '2015-03-02',
selected_hours := '5'
);This will store the execution plan along with your query, making it faster to execute after you have called it once. You could do the same with your big
SELECT query, without input parameters this time.Calculations
I feel that your insert query has a rather... unhealthy amount of
AVG calculations. AVG is expensive because it does both a SUM and COUNT together. The whole logic of your calculations seems terribly convoluted. It's hard to say how to improve it without having access to your data source, but I think it would be worth your while to take a piece of paper and try to draw out everything your query is doing (and then you'll have an idea just how convoluted it actually is).A few things pop out immediately...
0.50 AS "Submitted Bid",This
0.50 value is constantly used in your insert query. I'd suggest you declare it at the top and use it throughout...submitted_bid DECIMAL(3,2) := 0.50;Then, you write it more like:
submitted_bid AS "Submitted Bid",
CASE
WHEN ("Selected_Period".da_spread_diff*-1) <= submitted_bid
THEN TRUE
ELSE FALSE
END AS "Filled",
-- etc.This one comes up constantly in your insert query:
("Selected_Period".da_spread_diff*-1)Put that one in a variable too... (sorry, data type is unclear, use whatever is appropriate)
```
period_
Code Snippets
INSERT INTO tempresults(
lmp_date, approved_lmp_name_a, approved_lmp_name_b,
path, approved_a_pnode,
approved_b_pnode, approved_a_sink, approved_b_sink, approved_a_source,
approved_b_source,
lmp_da_a, lmp_da_b, da_spread_diff, lmp_da_spread_pl,
submitted_bid, filled, filled_percentage, priced_filled_at, filled_vs_settled,
profit_flag, filled_profit, loss_flag, filled_loss, lmp_rt_a,
lmp_rt_b, rt_spread_diff, dart_a, dart_b, dart_spread_diff, win_trigger, loss_trigger)INSERT INTO tempresults (
lmp_date,
approved_lmp_name_a,
approved_lmp_name_b,
path,
approved_a_pnode,
approved_b_pnode,
approved_a_sink,
approved_b_sink,
approved_a_source,
approved_b_source,
lmp_da_a,
lmp_da_b,
da_spread_diff,
lmp_da_spread_pl,
submitted_bid,
filled,
filled_percentage,
priced_filled_at,
filled_vs_settled,
profit_flag,
filled_profit,
loss_flag,
filled_loss,
lmp_rt_a,
lmp_rt_b,
rt_spread_diff,
dart_a,
dart_b,
dart_spread_diff,
win_trigger,
loss_trigger
)CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50 THEN SUM((("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)))+("Selected_Period".da_spread_diff) ELSE 0 END END AS "Filled Profit",CASE
WHEN ("Selected_Period".da_spread_diff*-1)<=0.50
THEN
CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50
THEN SUM((("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)))+("Selected_Period".da_spread_diff)
ELSE 0
END
END AS "Filled Profit",WHERE
"Selected_Period".lmp_date >= 'XYXYXY' AND -- starting date. user normally inputs this value
"Selected_Period".lmp_date <= 'YZYZYZ' AND -- ending date. user normally inputs this value
"Selected_Period".he = '5' --this is hour. user normally inputs this valueContext
StackExchange Code Review Q#83044, answer score: 6
Revisions (0)
No revisions yet.