HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

How to insert/update millions of rows inside a sql server table(s) in an effective way?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
rowsinsertupdatemillionssqlwayhowservereffectivetable

Problem

Often in my job I have to create a procedure inside SQL Server that will process millions of data rows, save them into a temp table (staging table) and finally save them into a table (s) in a database(s).

I am not looking into alternative solutions, like SSIS.

I cannot disable the indexes,constraints, take DB offline, change the recovery mode, etc.

we have looked for setting this process to run when the systems are less busy, but we work on a 24/7/365 online retailer environment.

there is a very similar question:
Performance Inserting and Updating Millions of rows into a table

This question is also relevant:
What is the fastest way to insert large numbers of rows?

example one:

```
CREATE PROCEDURE [dbo].[udpstaging_page_import_fromFilter]
@sourceDesc nvarchar(50) -- e.g. 'Coremetrics'
,@feedDesc nvarchar(50) -- e.g. 'Daily Exports'
,@process_job_task_logID bigint
AS BEGIN

SET NOCOUNT ON;
BEGIN TRY

--truncate table prior INSERT
exec dbo.udpstaging_page_truncateTable;

declare @source_feedID int;
exec crm_admin.dbo.udpsource_feedID_select @sourceDesc
,@feedDesc
,@source_feedID = @source_feedID OUTPUT;

-- drop temp tables
if exists (select * from tempdb.dbo.sysobjects o where o.xtype in ('U') and o.id = object_id(N'tempdb..#pageImport'))
drop table #pageImport;

-- create temp tables
create table #pageImport(
pageImportID [bigint] identity(1,1) NOT NULL
,pageCode varbinary NOT NULL
);

insert into #pageImport(
pageCode
)
select pageCode
from Coremetrics.PageView
group by pageCode;

-- add indexes to temp table
CREATE CLUSTERED INDEX IDX_pageImport_pageImportID ON #pageImport(pageImportID);
CREATE INDEX IDX_pageImport_

Solution

opinion based answers are not the most popular here, try to give evidence when possible.

Well, that is not entirely fair given that ultimately, the "evidence" for what works the best will come from your system ;-). Your hardware, your data, your system load, etc will determine what works the best. There are a lot of variables in both how to approach things as well as how your system works, so what works best in one system might not be so great in another.


1) how to decide the best way to organize the size of the batches? for example on the Example two it is 5000.

This is mostly a matter of trial-and-error to see what works best. However, it is important to keep in mind that lock escalation typically occurs at 5000 locks. Depending on how your data is organized, 5000 changes could be 5000 row locks or a few page locks. This is on a per-object basis since the ordering of the rows can be different across different indexes. The point is, changes made to tables that need to be usable by other processes during these operations should try to avoid table locks (i.e. the result of lock escalation). But tables such as the staging table and temp tables would benefit from table locks since it is a single operation and there should be no contention, hence the TABLOCK hint when doing a "bulk" INSERT.


2) would it generally have more chances of improving performance if I BEGIN TRANSACTION and COMMIT TRANSACTION within the while loop? One transaction for batch.

Wrapping multiple DML operations into an explicit transaction greatly helps performance when those operations are row-by-row. From the code you posted here, you are already doing set-based operations so there is only minor timing benefit from combining anything into a transaction. Also, in your WHILE loops in both examples you are doing a single INSERT operation, which is its own transaction, hence adding a transaction inside of the WHILE loop gains nothing anyway. And, adding an explicit transaction around the WHILE loop would put the entire set into a single transaction, which again might help a little on the timing, but you would then also have a huge transaction that would increase chances for blocking as well as contribute to LOG file growth since this transaction would be active for longer.


3) In case I would like to change the size of the batch, what I could monitor in order to decide whether I could increase the size of the batch, or I am causing I/O latency?

Monitor whether the process run faster or slower. Try several different batch sizes and let each one run for several iterations of the process. Keep track of how long the process run for each batch size and you will find what works best.

Along these lines, I would at least try reducing the million row batch size in Example 1.

I would also convert the scalar UDF dbo.udfDerivePageName into an Inline TVF and incorporate that into the query (Example 1) using CROSS APPLY or OUTER APPLY. And, considering that both calls to the UDF pass in the same two parameters, you would just reference the returned field twice (once as pageName and once as pageDesc) rather than having two calls to the iTVF.

Another option to reduce contention on the destination table ( if this is merely inserting new rows and not updating existing rows ), is to use Table Partitioning. This would allow you to stage the data as you are currently doing, but then rather than inserting that new data into the live table, you would SWITCH the new Partition in which is a rather quick operation. This won't help with the time or I/O it takes to stage the data in the first place, but it could eliminate the time and contention taken by "merging" the staged data into the live table. It's something to look into.

Context

StackExchange Database Administrators Q#119447, answer score: 4

Revisions (0)

No revisions yet.