patternsqlModerate
Emulate user-defined scalar function in a way which doesn't prevent parallelism
Viewed 0 times
preventuserscalarfunctionwaydoesnwhichparallelismdefinedemulate
Problem
I'm trying to see if there's a way to trick SQL Server to use a certain plan for the query.
Imagine you have some data which is shared between different processes. So, suppose we have some experiment results which take a lot of space. Then, for each process we know which year/month of experiment result we want to use.
Now, for every process we have parameters saved in the table
Let's add some test data:
Now, it's very easy to get experiment results by
The plan is nice and parallel:
``
- Environment
Imagine you have some data which is shared between different processes. So, suppose we have some experiment results which take a lot of space. Then, for each process we know which year/month of experiment result we want to use.
if object_id('dbo.SharedData') is not null
drop table SharedData
create table dbo.SharedData (
experiment_year int,
experiment_month int,
rn int,
calculated_number int,
primary key (experiment_year, experiment_month, rn)
)
goNow, for every process we have parameters saved in the table
if object_id('dbo.Params') is not null
drop table dbo.Params
create table dbo.Params (
session_id int,
experiment_year int,
experiment_month int,
primary key (session_id)
)
go- Test data
Let's add some test data:
insert into dbo.Params (session_id, experiment_year, experiment_month)
select 1, 2014, 3 union all
select 2, 2014, 4
go
insert into dbo.SharedData (experiment_year, experiment_month, rn, calculated_number)
select
2014, 3, row_number() over(order by v1.name), abs(Checksum(newid())) % 10
from master.dbo.spt_values as v1
cross join master.dbo.spt_values as v2
go
insert into dbo.SharedData (experiment_year, experiment_month, rn, calculated_number)
select
2014, 4, row_number() over(order by v1.name), abs(Checksum(newid())) % 10
from master.dbo.spt_values as v1
cross join master.dbo.spt_values as v2
go- Fetching results
Now, it's very easy to get experiment results by
@experiment_year/@experiment_month:create or alter function dbo.f_GetSharedData(@experiment_year int, @experiment_month int)
returns table
as
return (
select
d.rn,
d.calculated_number
from dbo.SharedData as d
where
d.experiment_year = @experiment_year and
d.experiment_month = @experiment_month
)
goThe plan is nice and parallel:
``
Solution
You cannot really safely achieve exactly what you want in SQL Server today, i.e. in a single statement and with parallel execution, within the restrictions laid out in the question (as I perceive them).
So my simple answer is no. The rest of this answer is mostly a discussion of why that is, in case it is of interest.
It is possible to get a parallel plan, as noted in the question, but there are two main varieties, neither of which are suitable for your needs:
-
A correlated nested loops join, with a round-robin distribute streams on the top level. Given that a single row is guaranteed to come from
-
The other alternative is for independent parallelism on the inner side of the nested loops join. Independent here means that threads are started up on the inner side, and not merely the same thread(s) as are executing the outer side of the nested loops join. SQL Server only supports independent inner-side nested loops parallelism when there is guaranteed to be one outer-side row and there are no correlated join parameters (plan 2).
So, we have a choice of a parallel plan that is serial (due to one thread) with the desired correlated values; or an inner-side parallel plan that has to scan because it has no parameters to seek with. (Aside: It really ought to be allowed to drive inner-side parallelism using exactly one set of correlated parameters, but it has never been implemented, probably for good reason).
A natural question then is: why do we need correlated parameters at all? Why can SQL Server not simply seek directly to the scalar values provided by e.g. a subquery?
Well, SQL Server can only 'index seek' using simple scalar references, e.g. a constant, variable, column, or expression reference (so a scalar function result can also qualify). A subquery (or other similar construction) is simply too complex (and potentially unsafe) to push into the storage engine whole. So, separate query plan operators are required. This is turn requires correlation, which means no parallelism of the sort you want.
All in all, there really is no better solution currently than methods like assigning the lookup values to variables and then using those in the function parameters in a separate statement.
Now you may have specific local considerations that means caching the current values of the year and month in
But this falls into the category of workaround.
On the other hand, if aggregation performance is of primary importance, you could consider sticking with inline functions and creating a columnstore index (primary or secondary) on the table. You may find the benefits of columnstore storage, batch mode processing, and aggregate pushdown provide greater benefits than a row-mode parallel seek anyway.
But beware of scalar T-SQL functions, especially with columnstore storage, since it is easy to end up with the function being evaluated per-row in a separate row-mode Filter. It is generally quite tricky to guarantee the number of times SQL Server will choose to evaluate scalars, and better not to try.
So my simple answer is no. The rest of this answer is mostly a discussion of why that is, in case it is of interest.
It is possible to get a parallel plan, as noted in the question, but there are two main varieties, neither of which are suitable for your needs:
-
A correlated nested loops join, with a round-robin distribute streams on the top level. Given that a single row is guaranteed to come from
Params for a specific session_id value, the inner side will run on a single thread, even though it is marked with the parallelism icon. This is why the apparently-parallel plan 3 does not perform as well; it is in fact serial.-
The other alternative is for independent parallelism on the inner side of the nested loops join. Independent here means that threads are started up on the inner side, and not merely the same thread(s) as are executing the outer side of the nested loops join. SQL Server only supports independent inner-side nested loops parallelism when there is guaranteed to be one outer-side row and there are no correlated join parameters (plan 2).
So, we have a choice of a parallel plan that is serial (due to one thread) with the desired correlated values; or an inner-side parallel plan that has to scan because it has no parameters to seek with. (Aside: It really ought to be allowed to drive inner-side parallelism using exactly one set of correlated parameters, but it has never been implemented, probably for good reason).
A natural question then is: why do we need correlated parameters at all? Why can SQL Server not simply seek directly to the scalar values provided by e.g. a subquery?
Well, SQL Server can only 'index seek' using simple scalar references, e.g. a constant, variable, column, or expression reference (so a scalar function result can also qualify). A subquery (or other similar construction) is simply too complex (and potentially unsafe) to push into the storage engine whole. So, separate query plan operators are required. This is turn requires correlation, which means no parallelism of the sort you want.
All in all, there really is no better solution currently than methods like assigning the lookup values to variables and then using those in the function parameters in a separate statement.
Now you may have specific local considerations that means caching the current values of the year and month in
SESSION_CONTEXT is worthwhile i.e.:SELECT FGSD.calculated_number, COUNT_BIG(*)
FROM dbo.f_GetSharedData
(
CONVERT(integer, SESSION_CONTEXT(N'experiment_year')),
CONVERT(integer, SESSION_CONTEXT(N'experiment_month'))
) AS FGSD
GROUP BY FGSD.calculated_number;But this falls into the category of workaround.
On the other hand, if aggregation performance is of primary importance, you could consider sticking with inline functions and creating a columnstore index (primary or secondary) on the table. You may find the benefits of columnstore storage, batch mode processing, and aggregate pushdown provide greater benefits than a row-mode parallel seek anyway.
But beware of scalar T-SQL functions, especially with columnstore storage, since it is easy to end up with the function being evaluated per-row in a separate row-mode Filter. It is generally quite tricky to guarantee the number of times SQL Server will choose to evaluate scalars, and better not to try.
Code Snippets
SELECT FGSD.calculated_number, COUNT_BIG(*)
FROM dbo.f_GetSharedData
(
CONVERT(integer, SESSION_CONTEXT(N'experiment_year')),
CONVERT(integer, SESSION_CONTEXT(N'experiment_month'))
) AS FGSD
GROUP BY FGSD.calculated_number;Context
StackExchange Database Administrators Q#202211, answer score: 13
Revisions (0)
No revisions yet.