patternModerate
I/O requests taking longer than 15 seconds
Viewed 0 times
longerthansecondsrequeststaking
Problem
Usually our weekly full backups finish in about 35 minutes, with daily diff backups finishing in ~5 minutes. Since tuesday the dailies have taken almost 4 hours to complete, way more than should be required. Coincidentally, this started happening right after we got a new SAN/disk config.
Note that the server is running in production and we have no overall issues, it's running smoothly - except for the IO issue that's primarily manifested itself in the backup performance.
Looking at dm_exec_requests during the backup, the backup is constantly waiting on ASYNC_IO_COMPLETION. Aha, so we have disk contention!
However, neither the MDF (logs are stored on local disk) nor backup drive have any activity (IOPS ~= 0 - we have plenty of memory). Disk queue length ~= 0 as well. CPU hovers around 2-3%, no issue there either.
The SAN is a Dell MD3220i, the LUN consisting of 6x10k SAS drives. The server is connected to the SAN through two physical paths, each going through a separate switch with redundant connections to the SAN - a total of four paths, two of them being active at any time. I can verify that both connections are active through task manager - splitting the load perfectly evenly. Both connections are running 1G full duplex.
We used to use jumbo frames, but I've disabled them to rule out any issues here - no change. We have another server (same OS+config, 2008 R2) that is connected to other LUNs, and it shows no issues. It is however not running SQL Server, but just sharing CIFS on top of them. However, one of its LUNs preferred path is on the same SAN controller as the troublesome LUNs - so I've ruled that out as well.
Running a couple of SQLIO tests (10G test file) seems to indicate that IO is decent, despite the issues:
```
sqlio -kR -t8 -o8 -s30 -frandom -b8 -BN -LS -Fparam.txt
IOs/sec: 3582.20
MBs/sec: 27.98
Min_Latency(ms): 0
Avg_Latency(ms): 3
Max_Latency(ms): 98
histogram:
ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Note that the server is running in production and we have no overall issues, it's running smoothly - except for the IO issue that's primarily manifested itself in the backup performance.
Looking at dm_exec_requests during the backup, the backup is constantly waiting on ASYNC_IO_COMPLETION. Aha, so we have disk contention!
However, neither the MDF (logs are stored on local disk) nor backup drive have any activity (IOPS ~= 0 - we have plenty of memory). Disk queue length ~= 0 as well. CPU hovers around 2-3%, no issue there either.
The SAN is a Dell MD3220i, the LUN consisting of 6x10k SAS drives. The server is connected to the SAN through two physical paths, each going through a separate switch with redundant connections to the SAN - a total of four paths, two of them being active at any time. I can verify that both connections are active through task manager - splitting the load perfectly evenly. Both connections are running 1G full duplex.
We used to use jumbo frames, but I've disabled them to rule out any issues here - no change. We have another server (same OS+config, 2008 R2) that is connected to other LUNs, and it shows no issues. It is however not running SQL Server, but just sharing CIFS on top of them. However, one of its LUNs preferred path is on the same SAN controller as the troublesome LUNs - so I've ruled that out as well.
Running a couple of SQLIO tests (10G test file) seems to indicate that IO is decent, despite the issues:
```
sqlio -kR -t8 -o8 -s30 -frandom -b8 -BN -LS -Fparam.txt
IOs/sec: 3582.20
MBs/sec: 27.98
Min_Latency(ms): 0
Avg_Latency(ms): 3
Max_Latency(ms): 98
histogram:
ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Solution
Note that all servers use the same NICs - Broadcom 5709Cs with
up-to-date drivers. The servers themselves are Dell R610's.
Kyle Brandt has an opinion on Broadcom network cards which echoes my own (repeated) experience.
Broadcom, Die Mutha
My problems have always been associated with TCP Offload features and in 99% of cases disabling or switching to a-n-other network card has resolved the symptoms. One client that (as in your case) uses Dell servers, always orders separate Intel NICs and disables the on-board Broadcom cards on build.
As described in this MSDN blog post, I would start with disabling in the OS with:
IIRC it may be necessary to disable the features at the card driver level in some circumstances, it certainly won't hurt to do so.
up-to-date drivers. The servers themselves are Dell R610's.
Kyle Brandt has an opinion on Broadcom network cards which echoes my own (repeated) experience.
Broadcom, Die Mutha
My problems have always been associated with TCP Offload features and in 99% of cases disabling or switching to a-n-other network card has resolved the symptoms. One client that (as in your case) uses Dell servers, always orders separate Intel NICs and disables the on-board Broadcom cards on build.
As described in this MSDN blog post, I would start with disabling in the OS with:
netsh int ip set chimney DISABLEDIIRC it may be necessary to disable the features at the card driver level in some circumstances, it certainly won't hurt to do so.
Code Snippets
netsh int ip set chimney DISABLEDContext
StackExchange Database Administrators Q#10950, answer score: 16
Revisions (0)
No revisions yet.