patternsqlMinor
Why is our query suddenly returning rows it should not (using READPAST and UPDLOCK options)?
Viewed 0 times
rowswhyreadpastqueryupdlockoptionssuddenlyreturningusingshould
Problem
We have a job table that looks like this
with a covering index like this
and we use this query to pick the next job
(The real table has about 10 columns. They are all in the index include() clause and the select column list.)
The execution plan is very simple. It does an index seek using IX_Status, then a top operator. Since the index is sorted on (status, skey) the plan does not need a sort.
The table is in a database in an AlwaysOn Availability Group. The group has 2 DB servers. (It is a test system.)
Normally this table and query work great.
So we go to apply Windows updates, and do the usual.
After the second fail over and all the worker processes get new connections to the new primary, the query starts failing in the sense that multiple processes start getting the same jobs.
The problem is load related. With 4 worker processes running it did not happen. But with 10 workers it happens consistently.
This is using SQL Server 2016 Enterprise. We do not have the query store enabled to see if the execution plan was weird at some point.
Any suggestions on why the query would start failing after two fail overs?
Since the query is only using the index and not touching the table, is UPDLOCK reliable?
Update 1 - we changed the process to list the locks held by the spid (using sp_lock @@spid) just after do
CREATE TABLE [dbo].[Clearing](
[Skey] [decimal](19, 0) IDENTITY(1,1) NOT NULL,
[BsAcctId] [int] NULL,
[Status] [varchar](20) NULL,
CONSTRAINT [csPk_Clearing] PRIMARY KEY CLUSTERED ( [Skey] ASC )
)with a covering index like this
CREATE NONCLUSTERED INDEX [IX_Status] ON [dbo].[Clearing]
(
[Status] ASC
)
INCLUDE ( [Skey], [BsAcctId])and we use this query to pick the next job
select top (1) Skey, BsAcctId, Status from Clearing with ( readpast, updlock )
where (Clearing.Status = 'NEW')
order by Clearing.Skey(The real table has about 10 columns. They are all in the index include() clause and the select column list.)
The execution plan is very simple. It does an index seek using IX_Status, then a top operator. Since the index is sorted on (status, skey) the plan does not need a sort.
The table is in a database in an AlwaysOn Availability Group. The group has 2 DB servers. (It is a test system.)
Normally this table and query work great.
So we go to apply Windows updates, and do the usual.
- Fail over the primary to the secondary
- Apply Windows updates on the former primary
- Fail over back to the original primary
- Apply Windows updates on the secondary
After the second fail over and all the worker processes get new connections to the new primary, the query starts failing in the sense that multiple processes start getting the same jobs.
The problem is load related. With 4 worker processes running it did not happen. But with 10 workers it happens consistently.
This is using SQL Server 2016 Enterprise. We do not have the query store enabled to see if the execution plan was weird at some point.
Any suggestions on why the query would start failing after two fail overs?
Since the query is only using the index and not touching the table, is UPDLOCK reliable?
Update 1 - we changed the process to list the locks held by the spid (using sp_lock @@spid) just after do
Solution
Getting the same row from different indexes
As David mentions in his answer, you can get the same row from multiple sessions if you happen to access that row via different indexes.
The
Running these two queries (with index hints) from different sessions results in the same row being returned:
After the failover, new execution plans will be compiled for incoming queries - so this could explain why you ended up with the new behavior after the failover. As David also said, you could force the index to avoid this problem.
As a side note, you should also use a
Getting different rows in the same session due to concurrency
You also mentioned this:
The problem is load related. With 4 worker processes running it did not happen. But with 10 workers it happens consistently.
So it sounds like the failover was not the only thing that changed - you also increased concurrency on the application side of things.
I tried loading up your table / index with some data:
Then I loaded up SQL Query Stress with your query, set it to run on 10 threads at once, every 100 ms:
While that was running, I periodically ran the same query with
Which you can see using the
Without much concurrency, you'll generally get the same row if you run that
Unexpected locking behavior
It's can be unsafe to depend solely on specific locks to be taken. Consider an optimization described in this blog post from Paul White (or another one like it): The Case of the Missing Shared Locks
The post outlines a situation where a row protected by an X lock can still be read by
SQL Server contains an optimization that allows it to avoid taking row-level shared (S) locks in the right circumstances. Specifically, it can skip shared locks if there is no risk of reading uncommitted data without them.
Related reading:
As David mentions in his answer, you can get the same row from multiple sessions if you happen to access that row via different indexes.
The
UPDLOCK hint only applies to the specific access method. Having a nonclustered index row U locked does not prevent another query acquiring a U lock on a different index (including the clustered index, if any).Running these two queries (with index hints) from different sessions results in the same row being returned:
-- Session 1
BEGIN TRANSACTION;
SELECT TOP (1) Skey, BsAcctId, Status
FROM dbo.Clearing WITH(READPAST, UPDLOCK, INDEX(2))
WHERE ([Status] = 'NEW')
ORDER BY Skey;
-- Session 2
BEGIN TRANSACTION;
SELECT TOP (1) Skey, BsAcctId, Status
FROM dbo.Clearing WITH(READPAST, UPDLOCK, INDEX(1))
WHERE ([Status] = 'NEW')
ORDER BY Skey;After the failover, new execution plans will be compiled for incoming queries - so this could explain why you ended up with the new behavior after the failover. As David also said, you could force the index to avoid this problem.
As a side note, you should also use a
ROWLOCK hint, since READPAST can only skip locks taken at the row granularity.Getting different rows in the same session due to concurrency
You also mentioned this:
The problem is load related. With 4 worker processes running it did not happen. But with 10 workers it happens consistently.
So it sounds like the failover was not the only thing that changed - you also increased concurrency on the application side of things.
I tried loading up your table / index with some data:
INSERT INTO dbo.Clearing
([Status])
SELECT TOP 100
'NEW'
FROM master.dbo.spt_values;
INSERT INTO dbo.Clearing
([Status])
SELECT TOP 10000
'COMPLETE'
FROM master.dbo.spt_values v1
CROSS JOIN master.dbo.spt_values v2;Then I loaded up SQL Query Stress with your query, set it to run on 10 threads at once, every 100 ms:
While that was running, I periodically ran the same query with
EXEC sp_lock @spid1 = @my_spid; tacked on to the end in SSMS. If I run the SELECT query in the same session multiple times (without rolling back), I can get multiple locks held by that session:Which you can see using the
%%lockres%% predicate:SELECT * FROM dbo.Clearing WITH (NOLOCK, INDEX(2)) WHERE %%lockres%% = '(36aaaeef6267)';
SELECT * FROM dbo.Clearing WITH (NOLOCK, INDEX(2)) WHERE %%lockres%% = '(84d6be32a10d)';Without much concurrency, you'll generally get the same row if you run that
SELECT more than once in the session. But with other queries taking and releasing locks all the time, it's easily possible to get different rows. So make sure you are not depending on the SELECT returning the same ID twice (we don't have the whole context of your workload, so this is just speculation / FYI).Unexpected locking behavior
It's can be unsafe to depend solely on specific locks to be taken. Consider an optimization described in this blog post from Paul White (or another one like it): The Case of the Missing Shared Locks
The post outlines a situation where a row protected by an X lock can still be read by
SELECT queries:SQL Server contains an optimization that allows it to avoid taking row-level shared (S) locks in the right circumstances. Specifically, it can skip shared locks if there is no risk of reading uncommitted data without them.
Related reading:
- Using tables as Queues by Remus Rusanu
- When READPAST Doesn’t Read Past by Erik Darling
Code Snippets
-- Session 1
BEGIN TRANSACTION;
SELECT TOP (1) Skey, BsAcctId, Status
FROM dbo.Clearing WITH(READPAST, UPDLOCK, INDEX(2))
WHERE ([Status] = 'NEW')
ORDER BY Skey;
-- Session 2
BEGIN TRANSACTION;
SELECT TOP (1) Skey, BsAcctId, Status
FROM dbo.Clearing WITH(READPAST, UPDLOCK, INDEX(1))
WHERE ([Status] = 'NEW')
ORDER BY Skey;INSERT INTO dbo.Clearing
([Status])
SELECT TOP 100
'NEW'
FROM master.dbo.spt_values;
INSERT INTO dbo.Clearing
([Status])
SELECT TOP 10000
'COMPLETE'
FROM master.dbo.spt_values v1
CROSS JOIN master.dbo.spt_values v2;SELECT * FROM dbo.Clearing WITH (NOLOCK, INDEX(2)) WHERE %%lockres%% = '(36aaaeef6267)';
SELECT * FROM dbo.Clearing WITH (NOLOCK, INDEX(2)) WHERE %%lockres%% = '(84d6be32a10d)';Context
StackExchange Database Administrators Q#241140, answer score: 5
Revisions (0)
No revisions yet.