HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Running functions in parallel

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
functionsrunningparallel

Problem

This is for SQL Server 2012.

We have some import processes for FTP files that are picked up and read into a staging table, from there we massage/check the data before moving into production. One of the areas that are causing some issues is dates, some are valid, some are typos, some are just plain gibberish.

I have the following example table(s):

Create Table RawData
(
 InsertID int not null,
 MangledDateTime1 varchar(10) null,
 MangledDateTime2 varchar(10) null,
 MangledDateTime3 varchar(10) null
)


I also have a destination table(say in production)

Create Table FinalData
(
  PrimaryKeyID int not null, -- PK constraint here, ident
  ForeighKeyID int not null, -- points to InsertID of RawData
  ValidDateTime1 SmallDateTime null,
  ValidDateTime2 SmallDateTime null,
  ValidDateTime3 SmallDateTime null
)


I insert the following into the RawData table:

Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
 Values(1, '20001010', '20800630', '00000000') -- First is legit, second two are not
 Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
 Values(1, '20800630', '20130630', '20000000') -- middle is legit, first/third are not
 Insert Into RawData(InsertID, MangledDateTime1, MangledDateTime2, MangledDateTime3)
 Values(1, '00001010', '00800630', '20130630') -- Last is legit, first two are not


I wrote a function dbo.CreateDate to address the issue(s). We try to clean the data as best we can (use NULL if we can't), then convert the data to the correct datatype (in this case smalldatetime).

Insert Into FinalData(ForeighKeyID , ValidDateTime1, ValidDateTime2, ValidDateTime3)
Select 
 InsertID
 ,dbo.CreateDate(MangledDateTime1)
 ,dbo.CreateDate(MangledDateTime2)
 ,dbo.CreateDate(MangledDateTime3)
From RawData


We're running into some performance problems with functions. I'm wondering if/how they work in parallel.

I'm assuming here that the function CreateDate is bei

Solution

Using T-SQL scalar functions will frequently lead to performance problems* because SQL Server makes a separate function call (using a whole new T-SQL context) for each row. In addition, parallel execution is disallowed for the whole query.

T-SQL scalar functions can also make it difficult to troubleshoot performance problems (whether those problems are caused by the function or not). The function appears as a 'black box' to the query optimizer: it is assigned a fixed low estimated cost, regardless of the actual content of the function.

See this and this for more on the pitfalls of scalar functions.

You will probably be better off using the new TRY_CONVERT function in SQL Server 2012:

SELECT
    InsertID,
    dt1 = TRY_CONVERT(smalldatetime, MangledDateTime1),
    dt2 = TRY_CONVERT(smalldatetime, MangledDateTime2),
    dt3 = TRY_CONVERT(smalldatetime, MangledDateTime3)
FROM dbo.RawData;

╔══════════╦═════════════════════╦═════════════════════╦═════════════════════╗
║ InsertID ║         dt1         ║         dt2         ║         dt3         ║
╠══════════╬═════════════════════╬═════════════════════╬═════════════════════╣
║        1 ║ 2000-10-10 00:00:00 ║ NULL                ║ NULL                ║
║        1 ║ NULL                ║ 2013-06-30 00:00:00 ║ NULL                ║
║        1 ║ NULL                ║ NULL                ║ 2013-06-30 00:00:00 ║
╚══════════╩═════════════════════╩═════════════════════╩═════════════════════╝


After the edit to the question

I see the function contains some specific logic. You could still look to use TRY_CONVERT as part of that, but you should definitely convert the scalar function to an in-line function. In-line functions (RETURNS TABLE) use a single SELECT statement and are expanded into the calling query and fully optimized in much the same way views are. It can be helpful to think of in-line functions as parameterized views.

For example, an approximate translation of the scalar function to an in-line version is:

CREATE FUNCTION dbo.CleanDate
    (@UnformattedString  varchar(12))
RETURNS TABLE
AS RETURN
SELECT Result =
    -- Successful conversion or NULL after
    -- workarounds applied in CROSS APPLY
    -- clauses below
    TRY_CONVERT(smalldatetime, ca3.string)
FROM
(
    -- Logic starts here
    SELECT        
        CASE
            WHEN @UnformattedString IS NULL
                THEN NULL
            WHEN LEN(@UnformattedString)  6 THEN '20' + RIGHT(ca1.string, 6)
        ELSE ca1.string
    END
) AS ca2 (string)
CROSS APPLY
(
    -- Next stage using result so far
    SELECT
        CASE 
            WHEN TRY_CONVERT(integer, LEFT(ca2.string, 4)) > YEAR(GETDATE())
                THEN '20790606'
            WHEN YEAR(GETDATE()) - TRY_CONVERT(integer, LEFT(ca2.string, 4)) >= 100
                THEN '20790606'
            ELSE ca2.string
        END
) AS ca3 (string);


The function used on the sample data:

SELECT
    InsertID,
    Result1 = CD1.Result,
    Result2 = CD2.Result,
    Result3 = CD3.Result
FROM dbo.RawData AS RD
CROSS APPLY dbo.CleanDate(RD.MangledDateTime1) AS CD1
CROSS APPLY dbo.CleanDate(RD.MangledDateTime2) AS CD2
CROSS APPLY dbo.CleanDate(RD.MangledDateTime3) AS CD3;


Output:

╔══════════╦═════════════════════╦═════════════════════╦═════════════════════╗
║ InsertID ║       Result1       ║       Result2       ║       Result3       ║
╠══════════╬═════════════════════╬═════════════════════╬═════════════════════╣
║        1 ║ 2000-10-10 00:00:00 ║ 2079-06-06 00:00:00 ║ NULL                ║
║        1 ║ 2079-06-06 00:00:00 ║ 2013-06-30 00:00:00 ║ 2079-06-06 00:00:00 ║
║        1 ║ 2000-10-10 00:00:00 ║ 2079-06-06 00:00:00 ║ 2013-06-30 00:00:00 ║
╚══════════╩═════════════════════╩═════════════════════╩═════════════════════╝


*CLR scalar functions have a much faster invocation path than T-SQL scalar functions and do not prevent parallelism.

Code Snippets

SELECT
    InsertID,
    dt1 = TRY_CONVERT(smalldatetime, MangledDateTime1),
    dt2 = TRY_CONVERT(smalldatetime, MangledDateTime2),
    dt3 = TRY_CONVERT(smalldatetime, MangledDateTime3)
FROM dbo.RawData;

╔══════════╦═════════════════════╦═════════════════════╦═════════════════════╗
║ InsertID ║         dt1         ║         dt2         ║         dt3         ║
╠══════════╬═════════════════════╬═════════════════════╬═════════════════════╣
║        1 ║ 2000-10-10 00:00:00 ║ NULL                ║ NULL                ║
║        1 ║ NULL                ║ 2013-06-30 00:00:00 ║ NULL                ║
║        1 ║ NULL                ║ NULL                ║ 2013-06-30 00:00:00 ║
╚══════════╩═════════════════════╩═════════════════════╩═════════════════════╝
CREATE FUNCTION dbo.CleanDate
    (@UnformattedString  varchar(12))
RETURNS TABLE
AS RETURN
SELECT Result =
    -- Successful conversion or NULL after
    -- workarounds applied in CROSS APPLY
    -- clauses below
    TRY_CONVERT(smalldatetime, ca3.string)
FROM
(
    -- Logic starts here
    SELECT        
        CASE
            WHEN @UnformattedString IS NULL
                THEN NULL
            WHEN LEN(@UnformattedString) <= 1
                THEN NULL
            WHEN LEN(@UnformattedString) = 12
                THEN LEFT(@UnformattedString, 8)
            ELSE @UnformattedString
        END
) AS Input (string)
CROSS APPLY
(
    -- Next stage using result so far
    SELECT 
        CASE 
            WHEN @UnformattedString = '20000000' 
            THEN '20790606' 
            ELSE Input.string
        END
) AS ca1 (string)
CROSS APPLY 
(
    -- Next stage using result so far
    SELECT CASE
        WHEN LEFT(ca1.string, 2) = '00' THEN '20' + RIGHT(ca1.string, 6)
        WHEN LEFT(ca1.string, 2) = '18' THEN '19' + RIGHT(ca1.string, 6)
        WHEN LEFT(ca1.string, 2) = '19' THEN ca1.string
        WHEN LEFT(ca1.string, 2) = '20' THEN ca1.string
        WHEN LEN(ca1.string) <> 6 THEN '20' + RIGHT(ca1.string, 6)
        ELSE ca1.string
    END
) AS ca2 (string)
CROSS APPLY
(
    -- Next stage using result so far
    SELECT
        CASE 
            WHEN TRY_CONVERT(integer, LEFT(ca2.string, 4)) > YEAR(GETDATE())
                THEN '20790606'
            WHEN YEAR(GETDATE()) - TRY_CONVERT(integer, LEFT(ca2.string, 4)) >= 100
                THEN '20790606'
            ELSE ca2.string
        END
) AS ca3 (string);
SELECT
    InsertID,
    Result1 = CD1.Result,
    Result2 = CD2.Result,
    Result3 = CD3.Result
FROM dbo.RawData AS RD
CROSS APPLY dbo.CleanDate(RD.MangledDateTime1) AS CD1
CROSS APPLY dbo.CleanDate(RD.MangledDateTime2) AS CD2
CROSS APPLY dbo.CleanDate(RD.MangledDateTime3) AS CD3;
╔══════════╦═════════════════════╦═════════════════════╦═════════════════════╗
║ InsertID ║       Result1       ║       Result2       ║       Result3       ║
╠══════════╬═════════════════════╬═════════════════════╬═════════════════════╣
║        1 ║ 2000-10-10 00:00:00 ║ 2079-06-06 00:00:00 ║ NULL                ║
║        1 ║ 2079-06-06 00:00:00 ║ 2013-06-30 00:00:00 ║ 2079-06-06 00:00:00 ║
║        1 ║ 2000-10-10 00:00:00 ║ 2079-06-06 00:00:00 ║ 2013-06-30 00:00:00 ║
╚══════════╩═════════════════════╩═════════════════════╩═════════════════════╝

Context

StackExchange Database Administrators Q#44997, answer score: 11

Revisions (0)

No revisions yet.