patternMajor
SQL Server has encountered occurences of I/O requests taking longer than 15 seconds
Viewed 0 times
sqllongerthansecondsencounteredoccurenceshasrequestsservertaking
Problem
On Production SQL Server, we have following config:
3 Dell PowerEdge R630 servers, combined into Availability Group
All 3 are connected to a single Dell SAN storage unit which is a RAID array
From time to time, on PRIMARY we are seeing messages similar to below:
SQL Server has encountered 11 occurrence(s) of I/O requests taking longer
than 15 seconds to complete on file [F:\Data\MyDatabase.mdf] in database id 8.
The OS file handle is 0x0000000000001FBC.
The offset of the latest long I/O is: 0x000004295d0000.
The duration of the long I/O is: 37397 ms.
We are novice in performance troubleshooting
What are the most common ways or best practices in troubleshooting this particular issue related to storage ?
What performance counters, tools, monitors, apps, etc. must be used to narrow down to the root cause of such messages ?
Might be there is a Extended Events that can help, or some kind of audit / logging ?
UPDATE: added my own Answer (see below) which explains what we did to fix the issue
3 Dell PowerEdge R630 servers, combined into Availability Group
All 3 are connected to a single Dell SAN storage unit which is a RAID array
From time to time, on PRIMARY we are seeing messages similar to below:
SQL Server has encountered 11 occurrence(s) of I/O requests taking longer
than 15 seconds to complete on file [F:\Data\MyDatabase.mdf] in database id 8.
The OS file handle is 0x0000000000001FBC.
The offset of the latest long I/O is: 0x000004295d0000.
The duration of the long I/O is: 37397 ms.
We are novice in performance troubleshooting
What are the most common ways or best practices in troubleshooting this particular issue related to storage ?
What performance counters, tools, monitors, apps, etc. must be used to narrow down to the root cause of such messages ?
Might be there is a Extended Events that can help, or some kind of audit / logging ?
UPDATE: added my own Answer (see below) which explains what we did to fix the issue
Solution
This is far less often a disk issue, and far more often a networking issue. You know, the N in SAN?
If you go to your SAN team and start talking about the disks being slow, they're gonna show you a fancy graph with 0 millisecond latency on it and then point a stapler at you.
Instead, ask them about the network path to the SAN. Get speeds, if it's multipathed, etc. Get numbers from them about the speeds you should be seeing. Ask if they have benchmarks from when the servers were set up.
Then you can use Crystal Disk Mark or diskpd to validate those speeds. If they don't line up, again, it's most likely the networking.
You should also search your error log for messages that contain "FlushCache" and "saturation", because those can also be signs of network contention.
One thing you can do to avoid those things as a DBA is make sure that your maintenance and any other data-heavy tasks (like ETL) aren't going on at the same time. That can definitely put a lot of pressure on storage networking.
You may also want to check the answers here for more suggestions: Slow checkpoint and 15 second I/O warnings on flash storage
I blogged about a similar topic here: From The Server To The SAN
If you go to your SAN team and start talking about the disks being slow, they're gonna show you a fancy graph with 0 millisecond latency on it and then point a stapler at you.
Instead, ask them about the network path to the SAN. Get speeds, if it's multipathed, etc. Get numbers from them about the speeds you should be seeing. Ask if they have benchmarks from when the servers were set up.
Then you can use Crystal Disk Mark or diskpd to validate those speeds. If they don't line up, again, it's most likely the networking.
You should also search your error log for messages that contain "FlushCache" and "saturation", because those can also be signs of network contention.
One thing you can do to avoid those things as a DBA is make sure that your maintenance and any other data-heavy tasks (like ETL) aren't going on at the same time. That can definitely put a lot of pressure on storage networking.
You may also want to check the answers here for more suggestions: Slow checkpoint and 15 second I/O warnings on flash storage
I blogged about a similar topic here: From The Server To The SAN
Context
StackExchange Database Administrators Q#240469, answer score: 32
Revisions (0)
No revisions yet.