patternsqlModerate
How does one investigate the performance of a BULK INSERT statement?
Viewed 0 times
theinsertstatementinvestigatebulkoneperformancedoeshow
Problem
I am mainly a .NET developer using Entity Framework ORM. However, because I don't want to fail in using the ORM, I am trying to understand what happens within the data layer (database). Basically, during the development I start the profiler and check what some parts of code generate in terms of queries.
If I spot something utterly complicated (ORM can generate awful queries even from rather simple LINQ statements, if not carefully written) and/or heavy (duration, CPU, page reads), I take it in SSMS and check its execution plan.
It works fine for my level of database knowledge. However, BULK INSERT seems to be a special creature, as it does not seem to produce a SHOWPLAN.
I will try to illustrate a very simple example:
Table definition
Note: no other indexes are defined on the table
The bulk insert
(what I catch in profiler, one batch only)
Metrics
For my application, that's ok, although the reads seems rather large (I know very little about SQL Server internals, so I comparing to the 8K page size and the small record information I have)
Question: how can I i
If I spot something utterly complicated (ORM can generate awful queries even from rather simple LINQ statements, if not carefully written) and/or heavy (duration, CPU, page reads), I take it in SSMS and check its execution plan.
It works fine for my level of database knowledge. However, BULK INSERT seems to be a special creature, as it does not seem to produce a SHOWPLAN.
I will try to illustrate a very simple example:
Table definition
CREATE TABLE dbo.ImportingSystemFileLoadInfo
(
ImportingSystemFileLoadInfoId INT NOT NULL IDENTITY(1, 1) CONSTRAINT PK_ImportingSystemFileLoadInfo PRIMARY KEY CLUSTERED,
EnvironmentId INT NOT NULL CONSTRAINT FK_ImportingSystemFileLoadInfo REFERENCES dbo.Environment,
ImportingSystemId INT NOT NULL CONSTRAINT FK_ImportingSystemFileLoadInfo_ImportingSystem REFERENCES dbo.ImportingSystem,
FileName NVARCHAR(64) NOT NULL,
FileImportTime DATETIME2 NOT NULL,
CONSTRAINT UQ_ImportingSystemImportInfo_EnvXIs_TableName UNIQUE (EnvironmentId, ImportingSystemId, FileName, FileImportTime)
)Note: no other indexes are defined on the table
The bulk insert
(what I catch in profiler, one batch only)
insert bulk [dbo].[ImportingSystemFileLoadInfo] ([EnvironmentId] Int, [ImportingSystemId] Int, [FileName] NVarChar(64) COLLATE Latin1_General_CI_AS, [FileImportTime] DateTime2(7))Metrics
- 695 items inserted
- CPU = 31
- Reads = 4271
- Writes = 24
- Duration = 154
- Total table count = 11500
For my application, that's ok, although the reads seems rather large (I know very little about SQL Server internals, so I comparing to the 8K page size and the small record information I have)
Question: how can I i
Solution
As far as I can tell you can optimize a bulk insert in a very similar way that you'd optimize a regular insert. Typically, a query plan for a simple insert isn't very informative so don't worry about not having the plan. I'll go over a few ways of optimizing an insert but most of them probably don't apply for the insert you specified in the question. However, they could be helpful if in the future you need to load larger amounts of data.
SQL Server will often sort data before inserting it into a table with a clustered index. For some tables and applications you can improve performance by sorting the data in the flat file and letting SQL Server know that the data is sorted through the
ORDER ( { column [ ASC | DESC ] } [ ,... n ] )
Specifies how the data in the data file is sorted. Bulk import performance is improved if the data being imported is sorted according to the clustered index on the table, if any.
Since you are using an
If you are guaranteed to have only one session inserting data into your table you can specify the
If
Another option is to enable trace flag 715.
Sometimes you will be able to tune inserts by changing the batch size.
ROWS_PER_BATCH = rows_per_batch
Indicates the approximate number of rows of data in the data file.
By default, all the data in the data file is sent to the server as a single transaction, and the number of rows in the batch is unknown to the query optimizer. If you specify ROWS_PER_BATCH (with a value > 0) the server uses this value to optimize the bulk-import operation. The value specified for ROWS_PER_BATCH should approximately the same as the actual number of rows. For information about performance considerations, see "Remarks," later in this topic.
Here is the quote from later in the article:
If the number of pages to be flushed in a single batch exceeds an internal threshold, a full scan of the buffer pool might occur to identify which pages to flush when the batch commits. This full scan can hurt bulk-import performance. A likely case of exceeding the internal threshold occurs when a large buffer pool is combined with a slow I/O subsystem. To avoid buffer overflows on large machines, either do not use the TABLOCK hint (which will remove the bulk optimizations) or use a smaller batch size (which preserves the bulk optimizations).
Because computers vary, we recommend that you test various batch sizes with your data load to find out what works best for you.
Personally I would just insert all 695 rows in a single batch. Tuning the batch size can make a big difference when inserting lots of data though.
I don't know anything about your data model or requirements, but don't fall into the trap of adding an
If you're loading a large amount of data into a table compared to what you already have then it may be faster to disable indexes or constraints before the load and to enable them after the load. For large amounts of data it is usually more inefficient for SQL Server to build an index all at once instead of as data is loaded into the table. It looks like you inserted 695 rows into a table with 11500 rows, so I would not recommend this technique.
Trace Flag 610 allows minimal logging in some additional scenarios. For your table with an
I/O Impact of Minimal Logging Under Trace Flag 610
When you commit a bulk load transaction that was
- Insert data in clustering key order
SQL Server will often sort data before inserting it into a table with a clustered index. For some tables and applications you can improve performance by sorting the data in the flat file and letting SQL Server know that the data is sorted through the
ORDER argument of BULK INSERT:ORDER ( { column [ ASC | DESC ] } [ ,... n ] )
Specifies how the data in the data file is sorted. Bulk import performance is improved if the data being imported is sorted according to the clustered index on the table, if any.
Since you are using an
IDENTITY column as the clustered key you don't need to worry about this.- Use
TABLOCKif possible
If you are guaranteed to have only one session inserting data into your table you can specify the
TABLOCK argument for BULK INSERT. This can reduce lock contention and can lead to minimal logging in some scenarios. However, you are inserting into a table with a clustered index that already contains data so you will not get minimal logging without trace flag 610 which is mentioned later in this answer.If
TABLOCK is not possible, because you can't change the code, not all hope is lost. Consider using sp_table_option:EXEC [sys].[sp_tableoption]
@TableNamePattern = N'dbo.BulkLoadTable' ,
@OptionName = 'table lock on bulk load' ,
@OptionValue = 'ON'Another option is to enable trace flag 715.
- Use an appropriate batch size
Sometimes you will be able to tune inserts by changing the batch size.
ROWS_PER_BATCH = rows_per_batch
Indicates the approximate number of rows of data in the data file.
By default, all the data in the data file is sent to the server as a single transaction, and the number of rows in the batch is unknown to the query optimizer. If you specify ROWS_PER_BATCH (with a value > 0) the server uses this value to optimize the bulk-import operation. The value specified for ROWS_PER_BATCH should approximately the same as the actual number of rows. For information about performance considerations, see "Remarks," later in this topic.
Here is the quote from later in the article:
If the number of pages to be flushed in a single batch exceeds an internal threshold, a full scan of the buffer pool might occur to identify which pages to flush when the batch commits. This full scan can hurt bulk-import performance. A likely case of exceeding the internal threshold occurs when a large buffer pool is combined with a slow I/O subsystem. To avoid buffer overflows on large machines, either do not use the TABLOCK hint (which will remove the bulk optimizations) or use a smaller batch size (which preserves the bulk optimizations).
Because computers vary, we recommend that you test various batch sizes with your data load to find out what works best for you.
Personally I would just insert all 695 rows in a single batch. Tuning the batch size can make a big difference when inserting lots of data though.
- Make sure that you need the
IDENTITYcolumn
I don't know anything about your data model or requirements, but don't fall into the trap of adding an
IDENTITY column to every table. Aaron Bertrand has an article about this called Bad habits to kick : putting an IDENTITY column on every table. To be clear, I'm not saying that you should remove the IDENTITY column from this table. However, if you determine that the IDENTITY column is not necessary and remove it that could improve insert performance.- Disable indexes or constraints
If you're loading a large amount of data into a table compared to what you already have then it may be faster to disable indexes or constraints before the load and to enable them after the load. For large amounts of data it is usually more inefficient for SQL Server to build an index all at once instead of as data is loaded into the table. It looks like you inserted 695 rows into a table with 11500 rows, so I would not recommend this technique.
- Consider TF 610
Trace Flag 610 allows minimal logging in some additional scenarios. For your table with an
IDENTITY clustered key, you would get minimal logging for any new data pages as long as your recovery model is simple or bulk-logged. I believe this feature is not on by default because it may degrade performance on some systems. You would need to test carefully before enabling this trace flag. The recommended Microsoft reference still appears to be The Data Loading Performance GuideI/O Impact of Minimal Logging Under Trace Flag 610
When you commit a bulk load transaction that was
Code Snippets
EXEC [sys].[sp_tableoption]
@TableNamePattern = N'dbo.BulkLoadTable' ,
@OptionName = 'table lock on bulk load' ,
@OptionValue = 'ON'DROP TABLE IF EXISTS X_TABLE;
CREATE TABLE X_TABLE (
VAL VARCHAR(1000) NOT NULL
);
SET STATISTICS IO, TIME ON;
INSERT INTO X_TABLE WITH (TABLOCK)
SELECT REPLICATE('Z', 1000)
FROM dbo.GetNums(10000); -- generate 10000 rowsContext
StackExchange Database Administrators Q#165966, answer score: 17
Revisions (0)
No revisions yet.