HiveBrain v1.2.0
Get Started
← Back to all entries
patternMajor

SQL Server has encountered occurences of I/O requests taking longer than 15 seconds

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
sqllongerthansecondsencounteredoccurenceshasrequestsservertaking

Problem

On Production SQL Server, we have following config:

3 Dell PowerEdge R630 servers, combined into Availability Group

All 3 are connected to a single Dell SAN storage unit which is a RAID array

From time to time, on PRIMARY we are seeing messages similar to below:

SQL Server has encountered 11 occurrence(s) of I/O requests taking longer
than 15 seconds to complete on file [F:\Data\MyDatabase.mdf] in database id 8.

The OS file handle is 0x0000000000001FBC.

The offset of the latest long I/O is: 0x000004295d0000.

The duration of the long I/O is: 37397 ms.

We are novice in performance troubleshooting

What are the most common ways or best practices in troubleshooting this particular issue related to storage ?

What performance counters, tools, monitors, apps, etc. must be used to narrow down to the root cause of such messages ?

Might be there is a Extended Events that can help, or some kind of audit / logging ?

UPDATE: added my own Answer (see below) which explains what we did to fix the issue

Solution

This is far less often a disk issue, and far more often a networking issue. You know, the N in SAN?

If you go to your SAN team and start talking about the disks being slow, they're gonna show you a fancy graph with 0 millisecond latency on it and then point a stapler at you.

Instead, ask them about the network path to the SAN. Get speeds, if it's multipathed, etc. Get numbers from them about the speeds you should be seeing. Ask if they have benchmarks from when the servers were set up.

Then you can use Crystal Disk Mark or diskpd to validate those speeds. If they don't line up, again, it's most likely the networking.

You should also search your error log for messages that contain "FlushCache" and "saturation", because those can also be signs of network contention.

One thing you can do to avoid those things as a DBA is make sure that your maintenance and any other data-heavy tasks (like ETL) aren't going on at the same time. That can definitely put a lot of pressure on storage networking.

You may also want to check the answers here for more suggestions: Slow checkpoint and 15 second I/O warnings on flash storage

I blogged about a similar topic here: From The Server To The SAN

Context

StackExchange Database Administrators Q#240469, answer score: 32

Revisions (0)

No revisions yet.