HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

why is io_stall_writes_ms so much higher for tempdb?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
whymuchtempdbforhigherio_stall_writes_ms

Problem

We have the user and system data files on the same disk drive. The ( io_stall_write_ms / ( 1.0 + num_of_writes ) ) is below 2 for the user files but the tempdb files are typically over 400. I see that on a few servers and I'm curious if there is a reason it takes longer to write to tempdb than a regular database data file.

SELECT DISTINCT UPPER(LEFT(mf.physical_name, 1)) AS Directory,
( io_stall_write_ms / ( 1.0 + num_of_writes ) ) as result, 
io_stall_write_ms, num_of_writes, 
fs.database_id, 
fs.[file_id]
FROM sys.dm_io_virtual_file_stats(NULL, NULL) AS fs
INNER JOIN sys.master_files AS mf ON fs.database_id = mf.database_id
AND fs.[file_id] = mf.[file_id]


Thank you,

Solution

Short Answer: Seeing higher IO stalls may or may not be a problem in an of itself. You need to look at more information to suss out if it you have an issue. It does seem a bit high, yes, but are you suffering? If so, it is probably because either your IO system is not handling the load right (because it can't, because you have everything on one drive or some other reason) or you are doing too much in TempDB (changing the first problem - the IO performance - is probably an easier and more efficient fix, but first determine if you have a problem)

The longer discussion/answer:

There are two questions at play here -

1.) What do I do when I see high IO Stalls?

First off, "high" is in the eye of the beholder. If you were to ask 10 DBAs what "too high" is for IO stalls you'd probably get 2-3 different answers with numbers in them, 5-6 "It depends" answers and one blank stare. My assumption is an average of 400ms is potentially too high here, especially when the other DBs are 2ms or lower for the average stall time.

Regardless of which database is seeing the high stalls you should approach it the same way. An IO stall is what it sounds like... An IO request taking longer than expected.. Stalling. These happen. They happen all the time in a system with resources being shared and finite resources (really all of our systems). They become an issue when the stalls become performance problems or lead to them. So I trust that you are looking here as a proactive part of monitoring or because you were experiencing performance issues that you are troubleshooting. We also don't want to get lost in just IO stalls. We are looking at a piece of the puzzle and not the big picture. It can be troublesome to just look at wait stats or file stats since SQL was last restarted because you are looking at all time and some maintenance window or heavy load window could skew counters. So make sure you look at the full picture.

But when I suspect I have a disk performance issue or see something off in a query like this, I normally follow a process that looks like:

  • Look at the wait statistics on the server. @swasheck shared a great link as a comment in a below answer. This takes you to Paul Randal's post on looking at and analyzing wait statistics in SQL Server. Go there. What kind of waits are you seeing? Do you see waits related to IO performance (PAGEIOLATCH_*, IO_COMPLETION, WRITELOG, etc. ?). If you do this is another indication that you have some IO related performance issues, just like the IO stalls. But it gives you another form of agreement here.



  • Look at the IO performance. In particular, look inside of perfmon at the Physical Disk:Avg Disk Sec/Read and Avg Sec Disk Sec/Write counters. These measure your latency. Watch these counters over a period of time saved to a performance log file. What did you see for averages? If you are seeing numbers over 0.020 seconds (20ms) this could be an issue. If you see numbers over 40-50ms avg or higher is a more firm indication of a problem. Also look at your spikes? How high do they go and how long do they last? If you see spikes into the hundreds of ms and they last for tens or scores of seconds or more and/or happen frequently you are more likely to have an issue with your IO performance for your workload.



  • Look at your IO setup. What is it? Local disks? SAN? Storage Array? What kind of throughout and IOPs should you see out of this? Is it sufficient for what you are trying to do? You may have undersized your IO for your workload. Don't just look at your physical spindles, RAID settings, etc. Look at your paths to your disks. Are you pushing everything through a single 1GB link that you are sharing with a lot of other traffic? Can you look at disk performance metrics from the storage's perspective.



(Note: for this wait stats analysis and perfmon analysis - look at various periods and type of usage. Do you have different usage statistics at night than you do during the day? Batch processing windows? Maintenance windows where you rebuild a lot of indexes? Look at these tools during each of these periods and understand what you are seeing for each)

Another IO performance consideration here -

  • You said system DBs and User DBs are shared. Is this production? If so, that isn't always the best scenario. Are you also sharing log file and data files on the same drives? That isn't the best scenario either. What else shares this storage? In a world where you are worrying about spindles and raid groups and disks and have to make decisions on who gets the best performing disks, I tend to (as a general rule of thumb.. which aren't great to have in the DB world but this one tends to hold true) go with my fastest and most dedicated to TempDB (more on that below), then the log files, then the data files. In a world where you have a big pile of disks on a device like a NetApp, Dell Equal Logic or EMC VNX, etc. you don't need to necessarily worry about separation as much - but you s

Context

StackExchange Database Administrators Q#30145, answer score: 18

Revisions (0)

No revisions yet.