HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMajor

Does CPU utilization affect the cost of foreign NUMA access?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
utilizationthenumaforeignaffectdoescpucostaccess

Problem

Scenario

Let's assume I have a SQL Server with 4 sockets with each 1 NUMA node. Each socket has 4 physical cores. There is 512 GB of memory total so each NUMA node has 128 GB of RAM.

A key table is loaded into the first NUMA node.
Question

Let's assume we have a lot of traffic reading from that table. If all physical cores of the socket that owns the NUMA node have 100 percent CPU utilization, does that negatively influence the cost of non-local NUMA access coming from other sockets? Or on the other hand is the cost of non-local NUMA access is irrespective of how busy that socket is?

I hope my question makes sense. Please let me know if it doesn't I will try to clarify.
Background

We had a database issue in our production server last week and some of our business processed appeared more impacted than others. We had queries with few logical reads taking more than 1 minute. We looked at overall CPU utilization which was around 60 percent. We did not look at socket specific CPU metrics. I/O metrics were average.

Solution

A hefty question :-)
I'll outline some of the factors involved. In any given context, these factors and others can vary and produce an interesting result.

Sorry I wasn't able to make this much shorter...

  • Accumuated CPU ms vs logical IO



  • SQL Server logical memory node alignment with physical NUMA nodes



  • Spinlock contention in query workspace memory allocation



  • Task assignment to schedulers



  • Relevant data placement in the buffer pool



  • Physical memory placement



  • Accumuated CPU ms vs logical IO



I use graphs of logical IO (or in perfmon terminology "buffer pool page lookups") against CPU utilization very often, in order to gauge cpu efficiency of workloads and look for spinlock prone cases.

But SQL Server accumulates CPU time with lots of activity other than page lookups and spinlocks:

  • Plans are compiled and re-compiled.



  • CLR code is executed.



  • Functions are performed.



A lot of other activities will chew up significant cpu time without being reflected in the page lookups.

In the workloads I observe, chief among these "non logical IO intensive but CPU-gobbling" activities is sorting/hashing activity.

It stands to reason: consider a contrived example of two queries against a hashtable with no nonclustered indexes. The two queries have identical resultsets, but one of the resultsets is completely unordered and the second resultset is ordered by more than one of the selected columns. The second query would be expected to consume more CPU time, even though it would reference the same number of pages in the buffer pool.

More about workspace memory, and how much of granted workspace has been used, in these posts:

-
http://sql-sasquatch.blogspot.com/2015/08/sql-server-grantedreservedstolen_4.html

-
http://sql-sasquatch.blogspot.com/2015/08/sql-server-workspace-memory-with-twist.html

-
http://sql-sasquatch.blogspot.com/2015/03/resource-governor-to-restrict-max-query.html

  • SQL Server logical memory node alignment with physical NUMA nodes



SQL Server (since incorporating its NUMA-aware strategies) by default creates a SQLOS memory node for each NUMA node on the server. As memory allocations grow, each allocation is controlled by one of the SQLOS memory nodes.

Ideally, the SQLOS memory nodes are completely aligned with the physical NUMA nodes. That is to say, each SQLOS memory node contains memory from a single NUMA node, with no other SQLOS memory node also containing memory from that same NUMA node.

However, that ideal situation is not always the case.

The following CSS SQL Server Engineers blog post (also included in Kin's response) details behavior which can lead to persisting cross-NUMA node memory allocations for the SQLOS memory nodes. When this happens, the performance impact can be devastating.

  • Link



There have been a few fixes for the particularly painful case of persistent cross-NUMA node reference. Probably others in addition to these two, as well:

-
FIX: Performance problems occur in NUMA environments during foreign page processing in SQL Server 2012 or SQL Server 2014

-
FIX: SQL Server performance issues in NUMA environments

  • Spinlock contention during allocation of workspace memory



This is where it starts to get fun. I've already described that sort and hash work in workspace memory consumes CPU but is not reflected in the bpool lookup numbers.

Spinlock contention is another layer to this particular fun. When memory is stolen from the buffer pool and allocated for use against a query memory grant, memory access is serialized with a spinlock. By default, this takes place with a resource partitioned at the NUMA node level. So every query on the same NUMA node using workspace memory can potentially experience spinlock contention when stealing memory against grants. Very important to note: this isn't "one time per query" contention risk, as it would be if the point of contention were at the time of the actual grant. Rather, its when memory is stolen against the grant - so a query with a very large memory grant will have many opportunities for spinlock contention if it uses most of its grant.

Trace flag 8048 does a great job of relieving this contention by further partitioning the resource at the core level.

Microsoft says "consider trace flag 8048 if 8 or more cores per socket". But... it's not really how many cores per socket (as long as there are multiple), but rather how many opportunities for contention in the work being done on a single NUMA node.

On the glued AMD processors (12 cores per socket, 2 NUMA nodes per socket) there were 6 cores per NUMA node. I saw a system with 4 of those CPUs (so eight NUMA nodes, 6 cores each) that was jammed up in spinlock convoy until trace flag 8048 was enabled.

I've seen this spinlock contention drag down performance on VMs as small as 4 vCPUs. Trace flag 8048 did what it was supposed to when enabled on those systems.

Considering that there are still some 4 core frequency optimized CPUs out there, with

Context

StackExchange Database Administrators Q#111462, answer score: 20

Revisions (0)

No revisions yet.