snippetsqlMinor
How to implement a Set based algorithm/UDF
Viewed 0 times
implementalgorithmbasedhowsetudf
Problem
I have an algorithm that I need to run against every row in a table with 800K rows and 38 columns.
The algorithm is implemented in VBA and does a bunch of math using values from some columns to manipulate other columns.
I'm currently using Excel (ADO) to query SQL and use VBA with client side cursors to apply the algorithm by loop through every row. It works but takes 7 hours to run.
The VBA code is complex enough that it would be a lot of work to recode it into T-SQL.
I've read about CLR integration and UDFs as possible routes. I also thought about putting the VBA code in an SSIS script task to get closer to the database but am sure an expert methodology to this type of performance problem exists.
Ideally I'd be able to run the algorithm against as many rows (all?) as possible in a parallel set based way.
Any help greatly predicated on how to get best performance with this type of problem.
--Edit
Thanks for the comments, I'm using MS SQL 2014 Enterprise, here's some more details:
The algorithm finds characteristic patterns in time series data. The functions within the algorithm perform polynomial smoothing, windowing, and finds regions of interest based on input criteria, returning a dozen values and some Boolean results.
My question is more about methodology than the actual algorithm: If I want to achieve parallel computation on many rows at once, what are my options.
I see re-code into T-SQL is recommended which is a lot of work but possible, however the algorithm developer works in VBA and it changes frequently so I'd need keep in sync with the T-SQL version and re-validate every change.
Is T-SQL the only way to implement set based functions?
The algorithm is implemented in VBA and does a bunch of math using values from some columns to manipulate other columns.
I'm currently using Excel (ADO) to query SQL and use VBA with client side cursors to apply the algorithm by loop through every row. It works but takes 7 hours to run.
The VBA code is complex enough that it would be a lot of work to recode it into T-SQL.
I've read about CLR integration and UDFs as possible routes. I also thought about putting the VBA code in an SSIS script task to get closer to the database but am sure an expert methodology to this type of performance problem exists.
Ideally I'd be able to run the algorithm against as many rows (all?) as possible in a parallel set based way.
Any help greatly predicated on how to get best performance with this type of problem.
--Edit
Thanks for the comments, I'm using MS SQL 2014 Enterprise, here's some more details:
The algorithm finds characteristic patterns in time series data. The functions within the algorithm perform polynomial smoothing, windowing, and finds regions of interest based on input criteria, returning a dozen values and some Boolean results.
My question is more about methodology than the actual algorithm: If I want to achieve parallel computation on many rows at once, what are my options.
I see re-code into T-SQL is recommended which is a lot of work but possible, however the algorithm developer works in VBA and it changes frequently so I'd need keep in sync with the T-SQL version and re-validate every change.
Is T-SQL the only way to implement set based functions?
Solution
With regards to methodology, I believe you are barking up the wrong b-tree ;-).
What we know:
First, let's consolidate and review what we know about the situation:
-
There is a Stored Procedure that is called for every row:
-
The definition (at least in part) is:
What we can surmise:
Next, we can look at all of these data points together to see if we can synthesize additional details that will help us find one or more bottle necks, and either point towards a solution, or at least rule some possible solutions out.
The current direction of thought in the comments is that the major issue is data transfer between SQL Server and Excel. Is that really the case? If the Stored Procedure is called for each of the 800,000 rows and takes 50 ms per each call (i.e. per each row), that adds up to 40,000 seconds (not ms). And that is equivalent to 666 minutes (hhmm ;-), or just over 11 hours. Yet the whole process was said to take only 7 hours to run. We are already 4 hours over the total time, and we have even added in time to do the calculations or save the results back to SQL Server. So something is not right here.
Looking at the definition of the Stored Procedure, there is only an input parameter for
I think that focusing on this "filter" Stored Procedure, or any data transfer from SQL Server to Excel, is a red herring.
For the moment, I think the most relevant indicators of lackluster performance are:
I suspect that:
My recommendation (based on currently available information):
-
Your biggest area of improvement would be to update multiple rows at one time (i.e. in one transaction). You should update your process to work in terms of each
-
If your clustered index isn't already defined as
-
You should consider moving the logic to a compiled language. I would suggest creating a .NET WinForms app or even Console App. I prefer Console App as it is easy to schedule via SQL Agent or Windows Scheduled Tasks. It shouldn't matter whether it is done in VB.NET or C#. VB.NET might be a more natural fit for your developer, but there will still be some learning curve.
I don't see any reason at this point to move to SQLCLR. If the a
What we know:
First, let's consolidate and review what we know about the situation:
- Somewhat complex calculations need to be performed:
- This needs to happen on every row of this table.
- The algorithm changes frequently.
- The algorithm ... [uses] values from some columns to manipulate other columns
- Current processing time is: 7 hours
- The table:
- contains 800,000 rows.
- has 38 columns.
- The application back-end:
- is Microsoft Excel.
- uses VBA (Visual Basic for Applications)** as the language.
- Database is SQL Server 2014, Enterprise Edition.
-
There is a Stored Procedure that is called for every row:
- This takes 50 ms (on avg, I assume) to run.
- It returns approximately 4000 rows.
-
The definition (at least in part) is:
SELECT AVG([AD_Sensor_Data])
OVER (ORDER BY [RowID] ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING)
as 'AD_Sensor_Data'
FROM [AD_Points]
WHERE [FileID] = @FileID
ORDER BY [RowID] ASC
What we can surmise:
Next, we can look at all of these data points together to see if we can synthesize additional details that will help us find one or more bottle necks, and either point towards a solution, or at least rule some possible solutions out.
The current direction of thought in the comments is that the major issue is data transfer between SQL Server and Excel. Is that really the case? If the Stored Procedure is called for each of the 800,000 rows and takes 50 ms per each call (i.e. per each row), that adds up to 40,000 seconds (not ms). And that is equivalent to 666 minutes (hhmm ;-), or just over 11 hours. Yet the whole process was said to take only 7 hours to run. We are already 4 hours over the total time, and we have even added in time to do the calculations or save the results back to SQL Server. So something is not right here.
Looking at the definition of the Stored Procedure, there is only an input parameter for
@FileID; there isn't any filter on @RowID. So I suspect that one of the following two scenarios is happening:- This Stored Procedure does not actually get called per each row, but instead per each
@FileID, which appears to span approximately 4000 rows. If the stated 4000 rows returned is a fairly consistent amount, then there are only 200 of those grouping in the 800,000 rows. And 200 executions taking 50 ms each amounts to only 10 seconds out of that 7 hours.
- If this stored procedure actually does get called for every row, then wouldn't the first time a new
@FileIDis passed in take slightly longer to pull new rows into the Buffer Pool, but then the next 3999 executions would typically return faster due to already being cached, right?
I think that focusing on this "filter" Stored Procedure, or any data transfer from SQL Server to Excel, is a red herring.
For the moment, I think the most relevant indicators of lackluster performance are:
- There are 800,000 rows
- The operation works on one row at a time
- The data is being saved back to SQL Server, hence "[uses] values from some columns to manipulate other columns" [ my emphasis ;-) ]
I suspect that:
- while there is some room for improvement on the data retrieval and calculations, making those better wouldn't amount to a significant reduction in processing time.
- the major bottleneck is issuing 800,000 separate
UPDATEstatements, which is 800,000 separate transactions.
My recommendation (based on currently available information):
-
Your biggest area of improvement would be to update multiple rows at one time (i.e. in one transaction). You should update your process to work in terms of each
FileID instead of each RowID. So:- read in all 4000 rows of a particular
FileIDinto an array
- the array should contain elements representing the fields being manipulated
- cycle through the array, processing each row as you currently do
- once all rows in the array (i.e. for this particular
FileID) have been calculated:
- start a transaction
- call each update per each
RowID
- if no errors, commit the transaction
- if an error occurred, rollback and handle appropriately
-
If your clustered index isn't already defined as
(FileID, RowID) then you should consider that (as @MikaelEriksson suggested in a comment on the Question). It won't help these singleton UPDATEs, but it would at least slightly improve the aggregate operations, such as what you are doing in that "filter" stored procedure since they are all based on FileID.-
You should consider moving the logic to a compiled language. I would suggest creating a .NET WinForms app or even Console App. I prefer Console App as it is easy to schedule via SQL Agent or Windows Scheduled Tasks. It shouldn't matter whether it is done in VB.NET or C#. VB.NET might be a more natural fit for your developer, but there will still be some learning curve.
I don't see any reason at this point to move to SQLCLR. If the a
Context
StackExchange Database Administrators Q#116542, answer score: 8
Revisions (0)
No revisions yet.