HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Database-Instance using half of available CPU cores

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
availablecoresdatabaseinstanceusingcpuhalf

Problem

I'm looking at the following scenario:

  • MSSQL Server 2019 Standard running on Windows Server 2016.



  • sys.dm_os_schedulers shows 12 available logical CPU cores.



  • Those 12 cores are split between 2 sockets.



  • Even though they're sitting on 2 sockets, they seem to be on the same NUMA node (the server is virtualized by VMWare) (coreinfo output):



*-----------  Physical Processor 0
-*----------  Physical Processor 1
--*---------  Physical Processor 2
---*--------  Physical Processor 3
----*-------  Physical Processor 4
-----*------  Physical Processor 5
------*-----  Physical Processor 6
-------*----  Physical Processor 7
--------*---  Physical Processor 8
---------*--  Physical Processor 9
----------*-  Physical Processor 10
-----------*  Physical Processor 11

Logical Processor to Socket Map:
******------  Socket 0
------******  Socket 1

Logical Processor to NUMA Node Map:
************  NUMA Node 0


  • There are no affinities configured.



  • Server is hosting multiple instances.



  • MAXDOP is set to 6 for all instances.



But still a symptom that I'm seeing is, that when instance 1 is running an intensive set of queries, the CPU is running on 50%, single cores are using 100% / very little CPU, alternating:

While this in itself is alright, at that time there is instance 2 which suddenly has terrible performance and is even losing connection to the database.

Does anyone have an idea or a hint where to look at to further narrow down this issue? Feel free to ask for any information you need. :-)

edits:
vSphere config for @ErikDarling:

errorlog output:

```
2023-09-05 23:15:51.95 Server Microsoft SQL Server 2019 (RTM-CU22) (KB5027702) - 15.0.4322.2 (X64)
Jul 27 2023 18:11:00
Copyright (C) 2019 Microsoft Corporation
Standard Edition (64-bit) on Windows Server 2016 Standard 10.0 (Build 14393: ) (Hypervisor)

2023-09-05 23:15:51.95 Server UTC adjustment: 2:00
2023-09-05 23:15:51.95 Server (c) Microsoft Corporation.
2023-09-05 23:15:

Solution

But still a symptom that I'm seeing is, that when instance 1 is running an intensive set of queries, the CPU is running on 50%, single cores are using 100% / very little CPU, alternating

While this in itself is alright, at that time there is instance 2 which suddenly has terrible performance and is even losing connection to the database.

The crux of the issue has to do with cpu scheduling and thread affinity in Windows and SQL Server.

  • There are two instances of SQL Server, neither knows about the other.



  • There are two sockets, SQL will set thread level affinity per socket (since no other trace flags are present).



  • SQL Server only presents threads to the OS to be scheduled, it does not control scheduling of them.



  • Windows schedules as it sees fit.



  • Virtualization skew due to cpu scheduling and processing abilities can come into play, such as cpu ready and co-stop in VMWare.



  • MAXDOP is set to 6, which is an entire socket. SQL Server will not typically span sockets or NUMA nodes to place parallel threads.



Without getting into the murky waters of parallel thread placement, it's a good assumption that it'll be placed on the set of threads local to the connection. If the connection originally came in on cpu 3, then it would be local to socket 0, which would set the thread placement to socket 0 affinitized threads (in this case 0 - 5). That's essentially where your 50% cpu usage comes from... SQL server is running the parallel query, the threads are set in SQL Server Instance 1 as available to run to the OS, the OS schedules them at some point to run.

This happens many times per second, any hits on a wait will cause the thread to not be scheduled as it'll be in an indefinite wait state according to the OS. Since these aren't the only threads to be scheduled as other items need to also run (such as Instance 2, services, etc.) and that may be on threads affinitized to the same set of processors. They may not get time to run or may not run to completion due to other factors (such as ISRs, DPCs, APCs, etc.).

If SQL Server Instance 1 runs an intense workload with many threads on the same cores that Instance 2 also wants, they both can't run at the same time. Additionally, the OS doesn't know anything about SQL, any possible timeouts, errors, or anything else, so it schedules threads as it sees fit. Assume Instance 1 runs a query that uses parallelism where the underlying implementation is a tight loop that does not internally yield to SQL OS (SQL Server implements a cooperative scheduling environment, vs Windows which is not). Assume that there are no other waits (disk, memory, network, etc.) on these threads. The threads may be scheduled for multiple runs on a cpu before threads from Instance 2 are run, again that's up to Windows. Assume instance 2 is running a thread that requires memory allocation, that thread may ask for an allocation from the Windows memory manager which in turn may need to wait for concurrency lock before continuing further. This would place that thread in a waiting state and be context switched off the cpu. Additionally assume that said thread had a timeout of 15 seconds. The timeout might be hit before the lock could be granted to that thread, which would only put the thread as alerted (according to Windows) and allow it to be scheduled which might be behind a bunch of other threads, such as the hidden schedulers in the instance, the other instance, incoming network packets, video rendering, disk completions, etc.

Is this a contrived example? Yep, for the fact of giving an easy to follow example.

This doesn't even bring in hypervisors, how they schedule, cpu scheduling drift allowed, etc., such as is possible in VMware. Nor does it bring into account any issues from the administrators of said virtualization platform who may be oversubscribing the host.

Example of Thread Affinity in SQL Server:

FROM sys.dm_os_threads  t
    inner join sys.dm_os_schedulers s
        on t.scheduler_address = s.scheduler_address


```
+----------------+--------------+----------------+--------------+-----------+
| parent_node_id | scheduler_id | status | os_thread_id | affinity |
+----------------+--------------+----------------+--------------+-----------+
| 0 | 0 | VISIBLE ONLINE | 6156 | 63 |
| 0 | 0 | VISIBLE ONLINE | 6160 | 63 |
| 0 | 0 | VISIBLE ONLINE | 2464 | 63 |
| 0 | 0 | VISIBLE ONLINE | 5396 | 63 |
| 0 | 0 | VISIBLE ONLINE | 2052 | 63 |
| 0 | 0 | VISIBLE ONLINE | 3468 | 63 |
| 0 | 0 | VISIBLE ONLINE | 1456 | 63 |
| 0 | 0 | VISIBLE ONLINE | 5744 | 63 |
| 0 | 0 | VISIBLE ONLINE | 1884 | 63 |
+----------------+--------------+-

Code Snippets

FROM sys.dm_os_threads  t
    inner join sys.dm_os_schedulers s
        on t.scheduler_address = s.scheduler_address
+----------------+--------------+----------------+--------------+-----------+
| parent_node_id | scheduler_id | status         | os_thread_id | affinity  |
+----------------+--------------+----------------+--------------+-----------+
| 0              | 0            | VISIBLE ONLINE | 6156         | 63        |
| 0              | 0            | VISIBLE ONLINE | 6160         | 63        |
| 0              | 0            | VISIBLE ONLINE | 2464         | 63        |
| 0              | 0            | VISIBLE ONLINE | 5396         | 63        |
| 0              | 0            | VISIBLE ONLINE | 2052         | 63        |
| 0              | 0            | VISIBLE ONLINE | 3468         | 63        |
| 0              | 0            | VISIBLE ONLINE | 1456         | 63        |
| 0              | 0            | VISIBLE ONLINE | 5744         | 63        |
| 0              | 0            | VISIBLE ONLINE | 1884         | 63        |
+----------------+--------------+----------------+--------------+-----------+

Context

StackExchange Database Administrators Q#334701, answer score: 10

Revisions (0)

No revisions yet.