HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

How to reduce huge CXPACKET & LATCH_EX (ACCESS_METHODS_DATASET_PARENT) wait times?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
waithugereducetimeshowcxpacketaccess_methods_dataset_parentlatch_ex

Problem

Problem

We have been experiencing high levels of user disruption due to SQL timeouts accross our systems since the beginning of the year.

The SQL-Server instance in question has very high CPU usage (higher than 90% on all 16 cores all the time) during business hours.

We have also noticed very high wait times: The combination of CXPACKET & LATCH_EX accounts for about 97% of all waits. This is split about 50/50 between CXPACKET & LATCH_EX.

The non-buffered latch wait accounting for the vast majority (>95%) of LATCH_EX is ACCESS_METHODS_DATASET_PARENT.

This suggests the problem is to do with parallelism.

An example of the scale of wait times is:

CXPACKET : 332,301,799 ms
LATCH_EX : 267,955,752 ms
PAGEIOLATCH_SH : 2,955,160 ms


This was for the period between 08:00-16:24 on Jan 11th.

Options under consideration

1) Change MAXDOP from 0 to something between 4 and 8

2) Modify the cost threshold of parallelism from 50 to a higher number

Suggestions most welcome on how to ease the very high CPU load we are seeing, and reduce timeouts, in particular whether the proposed course of action is wise, and which numbers to change MAXDOP and cost threshold of parallellism to.

Background information

-
SQL-Server 2008 R2 running on AMD Opteron 6180 SE, of which 16 cores are given to this instance of SQL-Server.

-
Type of workload: something of the order of 800 connections at the same time during business hours; majority OLTP type workload with some OLAP mixed in.

-
Microsoft SQL Server 2008 R2 (SP1) - 10.50.2500.0 (X64) ... Enterprise Edition (64-bit) on Windows NT 6.1 (Build 7601: Service Pack 1). Memory is appx 128 Gigs between 24 cores. 16 of the cores are available to this instance

Solution

majority OLTP type workload with some OLAP mixed in

This is the problem, not the CXPACKET. Parallelism is a symptom, not a cause. Your 'some' OLAP workload is doing scans which trigger parallelism, which cascades into exchange wait times, likely buffer pool pollution and possible blocking (OLTP workload blocks behind OLAP scans).

If the OLAP workload is well understood and absolutely critical, then you can consider adding appropriator covering indices for it. But is an ever uphill battle. I would much rather see the OLAP workload, with its damaging scans, move onto a dedicated box. Newer versions (SQL Server 2014) have readable secondaries and columnstores, both exquisite at serving analytical/ad-hoc/OLAP workloads.

For SQL Server 2008 R2 I would consider log shipping or replication (although I reckon none is 'perfect').

Short term: you have a performance problem and you need to analyse it appropriately. Read How to analyse SQL Server performance. Identify the query or queries that cause most damage (see identifying problem queries. Only then, after you identified the actual problem, can solutions be recommended.

NB: LATCH_EX on ACCESS_METHODS_DATASET_PARENT is not about IO at all. It is strictly related to parallelism, as is the latch the parallel scans 'child' threads must acquire on the parent thread to allocate the scan range for that child. Contention on it indicates the parallelism is inefficient (is doing more 'homework' than actual useful work). Partitioning would exacerbate this symptom, specially unaligned partitioning (as parent/child datasets are set per partition). Bad cardinality estimates (outdated stats?) may also be to blame, conducting to parallelism when not necessary. On and all my advice is the same: identify the actual problem query(s).

Context

StackExchange Database Administrators Q#126095, answer score: 7

Revisions (0)

No revisions yet.