patternsqlMinor
Normalizing Invoices
Viewed 0 times
normalizinginvoicesstackoverflow
Problem
I need to move data from one database to another, and since I don't have SSIS I'm doing this ETL with T-SQL scripts.
One of the source tables contains invoice details, and features a column that contains the number of units invoiced per size (it's clothing); as I transfer the data over to my database, I'm normalizing this information. The query works fine, the execution plan looks exactly as I expected, doesn't recommend adding any indexes, ...but it takes about 10 minutes to process.
The source table
The query is scheduled to run daily, as part of an overnight process - the 10 minutes don't really matter, but still I want to be sure everything is as efficient as it can be.
```
with cteSizedInvoices (
InvoiceNumber, InvoiceLine, SizeRangeCode, UnitsPerSize
)
as (select
src.f2,
cast(src.f5 as int),
src.f23,
src.f19
from Staging.dbo.[SourceTable] src
where src.f1 = '01'
)
--insert into [DestinationDatabase].dbo.InvoiceDetailSizes (InvoiceDetailId, SizeId, Units, DateInserted)
select
detail.Id InvoiceDetailId,
sz.Id SizeId,
buckets.Units,
getdate()
from
cteSizedInvoices src
inner join [DestinationDatabase].dbo.InvoiceHeaders header on src.InvoiceNumber = header.Number
inner join [DestinationDatabase].dbo.InvoiceDetails detail on src.InvoiceLine = detail.LineNumber
and detail.InvoiceHeaderId = header.Id
inner join [DestinationDatabase].dbo.SizeRanges ranges on src.SizeRangeCode = ranges.Code
cross apply (select id SizeIndex, cast(item as int) Units from [DestinationDatabase].dbo.BucketString(src.UnitsPerSize, 5, 1, 1)) buckets
inner join [DestinationDatabase].dbo.Sizes sz on buckets.SizeIndex = sz.SizeRangeIndex
One of the source tables contains invoice details, and features a column that contains the number of units invoiced per size (it's clothing); as I transfer the data over to my database, I'm normalizing this information. The query works fine, the execution plan looks exactly as I expected, doesn't recommend adding any indexes, ...but it takes about 10 minutes to process.
The source table
Staging.dbo.[SourceTable] contains about 780K rows; the query inserts 1.5M rows into the destination table [DestinationDatabase].dbo.InvoiceDetailSizes.The query is scheduled to run daily, as part of an overnight process - the 10 minutes don't really matter, but still I want to be sure everything is as efficient as it can be.
```
with cteSizedInvoices (
InvoiceNumber, InvoiceLine, SizeRangeCode, UnitsPerSize
)
as (select
src.f2,
cast(src.f5 as int),
src.f23,
src.f19
from Staging.dbo.[SourceTable] src
where src.f1 = '01'
)
--insert into [DestinationDatabase].dbo.InvoiceDetailSizes (InvoiceDetailId, SizeId, Units, DateInserted)
select
detail.Id InvoiceDetailId,
sz.Id SizeId,
buckets.Units,
getdate()
from
cteSizedInvoices src
inner join [DestinationDatabase].dbo.InvoiceHeaders header on src.InvoiceNumber = header.Number
inner join [DestinationDatabase].dbo.InvoiceDetails detail on src.InvoiceLine = detail.LineNumber
and detail.InvoiceHeaderId = header.Id
inner join [DestinationDatabase].dbo.SizeRanges ranges on src.SizeRangeCode = ranges.Code
cross apply (select id SizeIndex, cast(item as int) Units from [DestinationDatabase].dbo.BucketString(src.UnitsPerSize, 5, 1, 1)) buckets
inner join [DestinationDatabase].dbo.Sizes sz on buckets.SizeIndex = sz.SizeRangeIndex
Solution
You use the value of
could be declared from the beginning
After you do that it is a toss up whether or not to Declare another Variable for the Addition of the
Also please use some spaces in your equations, you use plenty of white space everywhere else.
I have a minute or two so I will also say this...
since this is going over so many records and that simple Variable freed up a good 10 seconds, I think that it might be worth it to move as much static calculation out of the query as possible, I know this is less about what SQL was meant for, but I also think this is more about being Dynamic as well.
Maybe I am going a little too far here, but this should still do the same thing as the original
I apologize for the mismatched casing, it's habit to capitalized those words...
When
I took out all the arithmetic that wasn't reliant on information from the query and made them their own variable, this isn't going to be a huge difference, but I think it will be faster than doing the arithmetic inside the query itself, it's like being distracted by that red ball all the time, it distracts you momentarily but you get the job done.
(@bucketSize + @bufferSize) inside of a where clause and to calculate a value in a select statement, I think if you took that and created it's own variable, that it might give you some performance increase. as well as some of the other arithmetic that could be done from the start and not calculated during the actual querylen(@values)/(@bucketSize+@bufferSize)+1could be declared from the beginning
DECLARE maxID
SET maxID = LEN(@values) / (@bucketSize + @bufferSize) + 1After you do that it is a toss up whether or not to Declare another Variable for the Addition of the
@bucketSize + @bufferSize, but it might be worth a try.Also please use some spaces in your equations, you use plenty of white space everywhere else.
I have a minute or two so I will also say this...
since this is going over so many records and that simple Variable freed up a good 10 seconds, I think that it might be worth it to move as much static calculation out of the query as possible, I know this is less about what SQL was meant for, but I also think this is more about being Dynamic as well.
Maybe I am going a little too far here, but this should still do the same thing as the original
create function [dbo].[BucketString] (
@values varchar(max),
@bucketSize int,
@bufferSize int = 1,
@offset int = 0)
returns @result table (id int, item varchar(max))
begin
DECLARE @bucketAndBufferSize
DECLARE @maxID
DECLARE @bufferMinusBufferMinusOne
DECLARE @bufferMinusOne
SET @bucketAndBufferSize = @bucketSize+@bufferSize
SET @maxID = LEN(@values) / (@bucketAndBufferSize) + 1
SET @bufferMinusBufferMinusOne = @bufferSize - (@bufferSize - 1)
SET @bufferMinusOne = @bufferSize - 1
with buckets as
(
select 1 id
union all
select t.id + 1
from buckets t
where id = t.id
and t.id < @maxID
)
insert into @result
select
id,
substring(@values, @offset + ((id - 1) * @bucketAndBufferSize + (case when @bufferMinusOne = 0 then 1 else @bufferMinusBufferMinusOne end)), @bucketSize) string
from buckets
option (maxrecursion 0)
return;
endI apologize for the mismatched casing, it's habit to capitalized those words...
When
bufferSize is unchanged you want it to be 1 so let's just do that and take out the extra call to the variable. I took out all the arithmetic that wasn't reliant on information from the query and made them their own variable, this isn't going to be a huge difference, but I think it will be faster than doing the arithmetic inside the query itself, it's like being distracted by that red ball all the time, it distracts you momentarily but you get the job done.
Code Snippets
len(@values)/(@bucketSize+@bufferSize)+1DECLARE maxID
SET maxID = LEN(@values) / (@bucketSize + @bufferSize) + 1create function [dbo].[BucketString] (
@values varchar(max),
@bucketSize int,
@bufferSize int = 1,
@offset int = 0)
returns @result table (id int, item varchar(max))
begin
DECLARE @bucketAndBufferSize
DECLARE @maxID
DECLARE @bufferMinusBufferMinusOne
DECLARE @bufferMinusOne
SET @bucketAndBufferSize = @bucketSize+@bufferSize
SET @maxID = LEN(@values) / (@bucketAndBufferSize) + 1
SET @bufferMinusBufferMinusOne = @bufferSize - (@bufferSize - 1)
SET @bufferMinusOne = @bufferSize - 1
with buckets as
(
select 1 id
union all
select t.id + 1
from buckets t
where id = t.id
and t.id < @maxID
)
insert into @result
select
id,
substring(@values, @offset + ((id - 1) * @bucketAndBufferSize + (case when @bufferMinusOne = 0 then 1 else @bufferMinusBufferMinusOne end)), @bucketSize) string
from buckets
option (maxrecursion 0)
return;
endContext
StackExchange Code Review Q#64165, answer score: 5
Revisions (0)
No revisions yet.