HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

~2 Hours after a large insert SQL Server gets "I/O requests taking longer than 15 seconds to complete"

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
afterinserthourssqllongerthansecondscompletelargerequests

Problem

This problem has been haunting me for a while but I actually fixed the queries that caused the problems (after I finally found them) but i'm curious if it's an SQL Process that's the underlying problem our that our raidcontrollers are shit or what the problem truely is. So here's the scenario that's been bothering us.

@ 11:10~ and 01:10~ the server would give these IO errors, even through Remote desktop the servers would fail, management studio would give timeouts, everything would point towards imminent failure of the cluster.

(The servers are identical, we run Windows Server 2012 standard, clustered in an AlwaysOn running SQL Server 2012 Enterprise edition).

Around 5-15 minutes later, the servers would come back to live (without anything in the eventlogs except for the Event ID 833 which is the "well known": SQL Server has encountered 16870 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\XXXXXXXX.mdf] in database [XXXXX] (10). The OS file handle is 0x0000000000001284. The offset of the latest long I/O is: 0x0000041fb3c000).

Now I was running profilers to find out which queries were running, doing anything I could to figure out what would cause the servers to just freeze up and refuse to service properly. The event log aims towards a specific Database to I searched in that direction and yesterday I found a table that looked odd to me.

The specific table was 11 gigs large, had indexes worth 23 gigs (haven't looked into those yet but that seemed wrong to me) and had 80 million records. Considering the table should hold around 2 million and be growing relatively slowly this sparked my interested and I found three stored procedures which fill this table.

All three storedprocedures had flaws which made them insert not a few hundred records but over 500.000 each time they ran. (These are now fixed so the issue doesn't present itself but I feel there is an underlying problem here).

These stored procedures would run @ 09:00 and

Solution

BOL refers to 833 event ID as :

This problem can be caused system performance issues, hardware errors, firmware errors, device driver problems, or filter driver intervention in the IO process.

What it means is that

  • SQL server is doing more I/O's that what the disk subsystem can handle.



  • There could be some rogue process running on the system that is saturating the disks with all the I/O requests.



Below is the approach I would follow:

Troubleshoot using Wait Statistics - DMV's

WITH [Waits] AS
    (SELECT
        [wait_type],
        [wait_time_ms] / 1000.0 AS [WaitS],
        ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS],
        [signal_wait_time_ms] / 1000.0 AS [SignalS],
        [waiting_tasks_count] AS [WaitCount],
        100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage],
        ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum]
    FROM sys.dm_os_wait_stats
    WHERE [wait_type] NOT IN (
        N'CLR_SEMAPHORE',    N'LAZYWRITER_SLEEP',
        N'RESOURCE_QUEUE',   N'SQLTRACE_BUFFER_FLUSH',
        N'SLEEP_TASK',       N'SLEEP_SYSTEMTASK',
        N'WAITFOR',          N'HADR_FILESTREAM_IOMGR_IOCOMPLETION',
        N'CHECKPOINT_QUEUE', N'REQUEST_FOR_DEADLOCK_SEARCH',
        N'XE_TIMER_EVENT',   N'XE_DISPATCHER_JOIN',
        N'LOGMGR_QUEUE',     N'FT_IFTS_SCHEDULER_IDLE_WAIT',
        N'BROKER_TASK_STOP', N'CLR_MANUAL_EVENT',
        N'CLR_AUTO_EVENT',   N'DISPATCHER_QUEUE_SEMAPHORE',
        N'TRACEWRITE',       N'XE_DISPATCHER_WAIT',
        N'BROKER_TO_FLUSH',  N'BROKER_EVENTHANDLER',
        N'FT_IFTSHC_MUTEX',  N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
        N'DIRTY_PAGE_POLL',  N'SP_SERVER_DIAGNOSTICS_SLEEP')
    )
SELECT
    [W1].[wait_type] AS [WaitType],
    CAST ([W1].[WaitS] AS DECIMAL(14, 2)) AS [Wait_S],
    CAST ([W1].[ResourceS] AS DECIMAL(14, 2)) AS [Resource_S],
    CAST ([W1].[SignalS] AS DECIMAL(14, 2)) AS [Signal_S],
    [W1].[WaitCount] AS [WaitCount],
    CAST ([W1].[Percentage] AS DECIMAL(4, 2)) AS [Percentage],
    CAST (([W1].[WaitS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgWait_S],
    CAST (([W1].[ResourceS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgRes_S],
    CAST (([W1].[SignalS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgSig_S]
FROM [Waits] AS [W1]
INNER JOIN [Waits] AS [W2]
    ON [W2].[RowNum] <= [W1].[RowNum]
GROUP BY [W1].[RowNum], [W1].[wait_type], [W1].[WaitS],
    [W1].[ResourceS], [W1].[SignalS], [W1].[WaitCount], [W1].[Percentage]
HAVING SUM ([W2].[Percentage]) - [W1].[Percentage] < 95; -- percentage threshold
GO


Look at the pending I/O's

SELECT database_id
    ,file_id
    ,io_stall
    ,io_pending_ms_ticks
    ,scheduler_address
FROM sys.dm_io_virtual_file_stats(NULL, NULL) t1
    ,sys.dm_io_pending_io_requests AS t2
WHERE t1.file_handle = t2.io_handle


Below are PERFMON counters that will help you:

  • Avg. Disk sec/Transfer => Time taken to perform the I/O operation



  • Disk Bytes /sec => Total reads and writes to disk per second in bytes.



  • Process:IO Data Bytes/Sec => Total reads and writes to disk per second in bytes by each process.



  • Buffer Manager: Page Read/sec + Page Writes/sec =>Total reads and writes to disk per second in bytes by SQL Server process.



Also, sometimes such errors can also occur when a different Power Management scheme is selected for the server. So better check your Windows Power Plan setting, and make sure you are using the High Performance Power Plan.

All three stored procedures had flaws which made them insert not a few hundred records but over 500.000 each time they ran. (These are now fixed so the issue doesn't present itself but I feel there is an underlying problem here).

If you are inserting a huge amount of data, then you need to refer to The Data Loading Performance Guide to cater your server for such operations e.g. enabling trace flag 610.

Note: Others have mentioned that Auto Update Stats might be causing it, which is also worth looking at, but I would primarily concentrate on finding out rouge process that is causing lots of Disk I/Os and checking the storage subsystem to see if all the divers, etc are up-to-date and is configured correctly.

Good reading links :

  • I/O requests taking longer than 15 seconds to complete on file



  • Troubleshooting SQL Server I/O requests taking longer than 15 seconds - I/O stalls & Disk latency



  • Event ID 833: I/O requests taking longer than 15 seconds

Code Snippets

WITH [Waits] AS
    (SELECT
        [wait_type],
        [wait_time_ms] / 1000.0 AS [WaitS],
        ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS],
        [signal_wait_time_ms] / 1000.0 AS [SignalS],
        [waiting_tasks_count] AS [WaitCount],
        100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage],
        ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum]
    FROM sys.dm_os_wait_stats
    WHERE [wait_type] NOT IN (
        N'CLR_SEMAPHORE',    N'LAZYWRITER_SLEEP',
        N'RESOURCE_QUEUE',   N'SQLTRACE_BUFFER_FLUSH',
        N'SLEEP_TASK',       N'SLEEP_SYSTEMTASK',
        N'WAITFOR',          N'HADR_FILESTREAM_IOMGR_IOCOMPLETION',
        N'CHECKPOINT_QUEUE', N'REQUEST_FOR_DEADLOCK_SEARCH',
        N'XE_TIMER_EVENT',   N'XE_DISPATCHER_JOIN',
        N'LOGMGR_QUEUE',     N'FT_IFTS_SCHEDULER_IDLE_WAIT',
        N'BROKER_TASK_STOP', N'CLR_MANUAL_EVENT',
        N'CLR_AUTO_EVENT',   N'DISPATCHER_QUEUE_SEMAPHORE',
        N'TRACEWRITE',       N'XE_DISPATCHER_WAIT',
        N'BROKER_TO_FLUSH',  N'BROKER_EVENTHANDLER',
        N'FT_IFTSHC_MUTEX',  N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
        N'DIRTY_PAGE_POLL',  N'SP_SERVER_DIAGNOSTICS_SLEEP')
    )
SELECT
    [W1].[wait_type] AS [WaitType],
    CAST ([W1].[WaitS] AS DECIMAL(14, 2)) AS [Wait_S],
    CAST ([W1].[ResourceS] AS DECIMAL(14, 2)) AS [Resource_S],
    CAST ([W1].[SignalS] AS DECIMAL(14, 2)) AS [Signal_S],
    [W1].[WaitCount] AS [WaitCount],
    CAST ([W1].[Percentage] AS DECIMAL(4, 2)) AS [Percentage],
    CAST (([W1].[WaitS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgWait_S],
    CAST (([W1].[ResourceS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgRes_S],
    CAST (([W1].[SignalS] / [W1].[WaitCount]) AS DECIMAL (14, 4)) AS [AvgSig_S]
FROM [Waits] AS [W1]
INNER JOIN [Waits] AS [W2]
    ON [W2].[RowNum] <= [W1].[RowNum]
GROUP BY [W1].[RowNum], [W1].[wait_type], [W1].[WaitS],
    [W1].[ResourceS], [W1].[SignalS], [W1].[WaitCount], [W1].[Percentage]
HAVING SUM ([W2].[Percentage]) - [W1].[Percentage] < 95; -- percentage threshold
GO
SELECT database_id
    ,file_id
    ,io_stall
    ,io_pending_ms_ticks
    ,scheduler_address
FROM sys.dm_io_virtual_file_stats(NULL, NULL) t1
    ,sys.dm_io_pending_io_requests AS t2
WHERE t1.file_handle = t2.io_handle

Context

StackExchange Database Administrators Q#51678, answer score: 3

Revisions (0)

No revisions yet.