patternsqlModerate
Running functions in parallel
Viewed 0 times
functionsrunningparallel
Problem
This is for SQL Server 2012.
We have some import processes for FTP files that are picked up and read into a staging table, from there we massage/check the data before moving into production. One of the areas that are causing some issues is dates, some are valid, some are typos, some are just plain gibberish.
I have the following example table(s):
I also have a destination table(say in production)
I insert the following into the RawData table:
I wrote a function
We're running into some performance problems with functions. I'm wondering if/how they work in parallel.
I'm assuming here that the function
We have some import processes for FTP files that are picked up and read into a staging table, from there we massage/check the data before moving into production. One of the areas that are causing some issues is dates, some are valid, some are typos, some are just plain gibberish.
I have the following example table(s):
Create Table RawData
(
InsertID int not null,
MangledDateTime1 varchar(10) null,
MangledDateTime2 varchar(10) null,
MangledDateTime3 varchar(10) null
)I also have a destination table(say in production)
Create Table FinalData
(
PrimaryKeyID int not null, -- PK constraint here, ident
ForeighKeyID int not null, -- points to InsertID of RawData
ValidDateTime1 SmallDateTime null,
ValidDateTime2 SmallDateTime null,
ValidDateTime3 SmallDateTime null
)I insert the following into the RawData table:
Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
Values(1, '20001010', '20800630', '00000000') -- First is legit, second two are not
Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
Values(1, '20800630', '20130630', '20000000') -- middle is legit, first/third are not
Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
Values(1, '00001010', '00800630', '20130630') -- Last is legit, first two are notI wrote a function
dbo.CreateDate to address the issue(s). We try to clean the data as best we can (use NULL if we can't), then convert the data to the correct datatype (in this case smalldatetime).Insert Into FinalData(ForeighKeyID , ValidDateTime1, ValidDateTime2, ValidDateTime3)
Select
InsertID
,dbo.CreateDate(MangledDateTime1)
,dbo.CreateDate(MangledDateTime2)
,dbo.CreateDate(MangledDateTime3)
From RawDataWe're running into some performance problems with functions. I'm wondering if/how they work in parallel.
I'm assuming here that the function
CreateDate is beiSolution
Using T-SQL scalar functions will frequently lead to performance problems* because SQL Server makes a separate function call (using a whole new T-SQL context) for each row. In addition, parallel execution is disallowed for the whole query.
T-SQL scalar functions can also make it difficult to troubleshoot performance problems (whether those problems are caused by the function or not). The function appears as a 'black box' to the query optimizer: it is assigned a fixed low estimated cost, regardless of the actual content of the function.
See this and this for more on the pitfalls of scalar functions.
You will probably be better off using the new TRY_CONVERT function in SQL Server 2012:
After the edit to the question
I see the function contains some specific logic. You could still look to use
For example, an approximate translation of the scalar function to an in-line version is:
The function used on the sample data:
Output:
*CLR scalar functions have a much faster invocation path than T-SQL scalar functions and do not prevent parallelism.
T-SQL scalar functions can also make it difficult to troubleshoot performance problems (whether those problems are caused by the function or not). The function appears as a 'black box' to the query optimizer: it is assigned a fixed low estimated cost, regardless of the actual content of the function.
See this and this for more on the pitfalls of scalar functions.
You will probably be better off using the new TRY_CONVERT function in SQL Server 2012:
SELECT
InsertID,
dt1 = TRY_CONVERT(smalldatetime, MangledDateTime1),
dt2 = TRY_CONVERT(smalldatetime, MangledDateTime2),
dt3 = TRY_CONVERT(smalldatetime, MangledDateTime3)
FROM dbo.RawData;
╔══════════╦═════════════════════╦═════════════════════╦═════════════════════╗
║ InsertID ║ dt1 ║ dt2 ║ dt3 ║
╠══════════╬═════════════════════╬═════════════════════╬═════════════════════╣
║ 1 ║ 2000-10-10 00:00:00 ║ NULL ║ NULL ║
║ 1 ║ NULL ║ 2013-06-30 00:00:00 ║ NULL ║
║ 1 ║ NULL ║ NULL ║ 2013-06-30 00:00:00 ║
╚══════════╩═════════════════════╩═════════════════════╩═════════════════════╝After the edit to the question
I see the function contains some specific logic. You could still look to use
TRY_CONVERT as part of that, but you should definitely convert the scalar function to an in-line function. In-line functions (RETURNS TABLE) use a single SELECT statement and are expanded into the calling query and fully optimized in much the same way views are. It can be helpful to think of in-line functions as parameterized views.For example, an approximate translation of the scalar function to an in-line version is:
CREATE FUNCTION dbo.CleanDate
(@UnformattedString varchar(12))
RETURNS TABLE
AS RETURN
SELECT Result =
-- Successful conversion or NULL after
-- workarounds applied in CROSS APPLY
-- clauses below
TRY_CONVERT(smalldatetime, ca3.string)
FROM
(
-- Logic starts here
SELECT
CASE
WHEN @UnformattedString IS NULL
THEN NULL
WHEN LEN(@UnformattedString) 6 THEN '20' + RIGHT(ca1.string, 6)
ELSE ca1.string
END
) AS ca2 (string)
CROSS APPLY
(
-- Next stage using result so far
SELECT
CASE
WHEN TRY_CONVERT(integer, LEFT(ca2.string, 4)) > YEAR(GETDATE())
THEN '20790606'
WHEN YEAR(GETDATE()) - TRY_CONVERT(integer, LEFT(ca2.string, 4)) >= 100
THEN '20790606'
ELSE ca2.string
END
) AS ca3 (string);The function used on the sample data:
SELECT
InsertID,
Result1 = CD1.Result,
Result2 = CD2.Result,
Result3 = CD3.Result
FROM dbo.RawData AS RD
CROSS APPLY dbo.CleanDate(RD.MangledDateTime1) AS CD1
CROSS APPLY dbo.CleanDate(RD.MangledDateTime2) AS CD2
CROSS APPLY dbo.CleanDate(RD.MangledDateTime3) AS CD3;Output:
╔══════════╦═════════════════════╦═════════════════════╦═════════════════════╗
║ InsertID ║ Result1 ║ Result2 ║ Result3 ║
╠══════════╬═════════════════════╬═════════════════════╬═════════════════════╣
║ 1 ║ 2000-10-10 00:00:00 ║ 2079-06-06 00:00:00 ║ NULL ║
║ 1 ║ 2079-06-06 00:00:00 ║ 2013-06-30 00:00:00 ║ 2079-06-06 00:00:00 ║
║ 1 ║ 2000-10-10 00:00:00 ║ 2079-06-06 00:00:00 ║ 2013-06-30 00:00:00 ║
╚══════════╩═════════════════════╩═════════════════════╩═════════════════════╝*CLR scalar functions have a much faster invocation path than T-SQL scalar functions and do not prevent parallelism.
Code Snippets
SELECT
InsertID,
dt1 = TRY_CONVERT(smalldatetime, MangledDateTime1),
dt2 = TRY_CONVERT(smalldatetime, MangledDateTime2),
dt3 = TRY_CONVERT(smalldatetime, MangledDateTime3)
FROM dbo.RawData;
╔══════════╦═════════════════════╦═════════════════════╦═════════════════════╗
║ InsertID ║ dt1 ║ dt2 ║ dt3 ║
╠══════════╬═════════════════════╬═════════════════════╬═════════════════════╣
║ 1 ║ 2000-10-10 00:00:00 ║ NULL ║ NULL ║
║ 1 ║ NULL ║ 2013-06-30 00:00:00 ║ NULL ║
║ 1 ║ NULL ║ NULL ║ 2013-06-30 00:00:00 ║
╚══════════╩═════════════════════╩═════════════════════╩═════════════════════╝CREATE FUNCTION dbo.CleanDate
(@UnformattedString varchar(12))
RETURNS TABLE
AS RETURN
SELECT Result =
-- Successful conversion or NULL after
-- workarounds applied in CROSS APPLY
-- clauses below
TRY_CONVERT(smalldatetime, ca3.string)
FROM
(
-- Logic starts here
SELECT
CASE
WHEN @UnformattedString IS NULL
THEN NULL
WHEN LEN(@UnformattedString) <= 1
THEN NULL
WHEN LEN(@UnformattedString) = 12
THEN LEFT(@UnformattedString, 8)
ELSE @UnformattedString
END
) AS Input (string)
CROSS APPLY
(
-- Next stage using result so far
SELECT
CASE
WHEN @UnformattedString = '20000000'
THEN '20790606'
ELSE Input.string
END
) AS ca1 (string)
CROSS APPLY
(
-- Next stage using result so far
SELECT CASE
WHEN LEFT(ca1.string, 2) = '00' THEN '20' + RIGHT(ca1.string, 6)
WHEN LEFT(ca1.string, 2) = '18' THEN '19' + RIGHT(ca1.string, 6)
WHEN LEFT(ca1.string, 2) = '19' THEN ca1.string
WHEN LEFT(ca1.string, 2) = '20' THEN ca1.string
WHEN LEN(ca1.string) <> 6 THEN '20' + RIGHT(ca1.string, 6)
ELSE ca1.string
END
) AS ca2 (string)
CROSS APPLY
(
-- Next stage using result so far
SELECT
CASE
WHEN TRY_CONVERT(integer, LEFT(ca2.string, 4)) > YEAR(GETDATE())
THEN '20790606'
WHEN YEAR(GETDATE()) - TRY_CONVERT(integer, LEFT(ca2.string, 4)) >= 100
THEN '20790606'
ELSE ca2.string
END
) AS ca3 (string);SELECT
InsertID,
Result1 = CD1.Result,
Result2 = CD2.Result,
Result3 = CD3.Result
FROM dbo.RawData AS RD
CROSS APPLY dbo.CleanDate(RD.MangledDateTime1) AS CD1
CROSS APPLY dbo.CleanDate(RD.MangledDateTime2) AS CD2
CROSS APPLY dbo.CleanDate(RD.MangledDateTime3) AS CD3;╔══════════╦═════════════════════╦═════════════════════╦═════════════════════╗
║ InsertID ║ Result1 ║ Result2 ║ Result3 ║
╠══════════╬═════════════════════╬═════════════════════╬═════════════════════╣
║ 1 ║ 2000-10-10 00:00:00 ║ 2079-06-06 00:00:00 ║ NULL ║
║ 1 ║ 2079-06-06 00:00:00 ║ 2013-06-30 00:00:00 ║ 2079-06-06 00:00:00 ║
║ 1 ║ 2000-10-10 00:00:00 ║ 2079-06-06 00:00:00 ║ 2013-06-30 00:00:00 ║
╚══════════╩═════════════════════╩═════════════════════╩═════════════════════╝Context
StackExchange Database Administrators Q#44997, answer score: 11
Revisions (0)
No revisions yet.