debugsqlMinor
Why and how sometimes I get the "absent table error" while there are definitely Sch-M lock?
Viewed 0 times
sometimesabsentwhytheerrorwhilearegethowand
Problem
I have a synthetic test, which reproduces some error we have on the production environment. Here are the 2 scripts to reproduce it:
1st
2nd
So, the problem is that when I run the 1st script in one session and leave it running, and then run the 2nd in separate session I can get this type of error:
Msg 208, Level 16, State 1, Line 13 Invalid object name 'Test'.
The question is - WHY i see this error in case when there is a
I have a feeling, that it is somehow connected to the situation, when the script commits, there is still the table with the name
Can somebody answer or guide me to the tech papers I can read deeply and understand the reason?
1st
DBCC TRACEOFF(-1,3604,1200) WITH NO_INFOMSGS;
SET NOCOUNT ON;
IF @@TRANCOUNT > 0 ROLLBACK
IF object_id('test') IS NOT NULL
DROP TABLE test
IF object_id('TMP_test') IS NOT NULL
DROP TABLE TMP_test
IF object_id('test1') IS NOT NULL
DROP TABLE test1
CREATE TABLE test(Id INT IDENTITY PRIMARY KEY)
GO
INSERT test DEFAULT VALUES
GO 2000
WHILE 1 = 1
BEGIN
CREATE TABLE TMP_test(Id INT PRIMARY KEY)
INSERT TMP_test SELECT * FROM test
WAITFOR DELAY '0:00:00.1'
BEGIN TRAN
EXEC sp_rename 'test', 'test1'
EXEC sp_rename 'TMP_test', 'test'
EXEC sp_rename 'test1', 'TMP_test'
DROP TABLE TMP_test
commit
END2nd
SET NOCOUNT ON;
DECLARE @c INT
WHILE 1 = 1
BEGIN
SELECT @c = COUNT(*) FROM Test IF @@ERROR <> 0 BREAK
SELECT @c = COUNT(*) FROM Test IF @@ERROR <> 0 BREAK
/* and repeat this 10-20 times more*/
SELECT @c = COUNT(*) FROM Test IF @@ERROR <> 0 BREAK
ENDSo, the problem is that when I run the 1st script in one session and leave it running, and then run the 2nd in separate session I can get this type of error:
Msg 208, Level 16, State 1, Line 13 Invalid object name 'Test'.
The question is - WHY i see this error in case when there is a
COMMIT in the cycle end of the 1st script and never get one if there is a ROLLBACK?I have a feeling, that it is somehow connected to the situation, when the script commits, there is still the table with the name
test but it is a different object and the 2nd script have to recompile itself. And this is OK. But why it gets the missed table error? AFAIK - when I rename the table inside a transaction - it holds Sch-M lock to the tran end?Can somebody answer or guide me to the tech papers I can read deeply and understand the reason?
Solution
Really interesting and tough question. I was not able to find any official documentation of this behavior, and I suspect there may not be any (though I'd love it if someone corrects me on that!).
My research leads me to believe that it is the plan compilation step that is vulnerable to this race condition. Note that I was able to run your test queries for an hour with no errors, but if I repeatedly kick kick off your process, I get an immediate error a portion of the time. When it does hit an error, it always does so right away on plan compilation. Alternatively, you can add "OPTION RECOMPILE" to the COUNT(*) in the loop, forcing a new plan to compile at each trial. Using this approach, I see the error almost immediately every time I have run your scripts.
I was able to reproduce the error with a controlled series of steps that seems to hit the error on every trial, removing the need to set up a loop and rely on randomness.
I also threw in a potential fix (to use ALTER TABLE...SWITCH) which might be possible to try out in your production environment. Now, on to the details!
Here are the steps to reproduce:
Using these steps, we can dig in a little bit more to what is causing the error. To do so, I ran a trace with the following events on: SP:StmtStarting, SP:StmtCompleted, Lock:Acquired, Lock:Released.
What I found is that the following occurs, in order (but omitting some detail in between):
However, if a similar sequence of events occurs when a plan compilation is not needed (such as when running your loop), SQL Server is actually smart enough to detect that the object_id has changed. I also traced that case, and I found a situation where the second COUNT(*) performed the following sequence
So it does look like this adaptive logic is in place, as you hypothesized. But it looks like it may be in place only for query execution, and not for query compilation.
As promised, here is an alternative snippet for section (3) that appears to provide a way of renaming
My research leads me to believe that it is the plan compilation step that is vulnerable to this race condition. Note that I was able to run your test queries for an hour with no errors, but if I repeatedly kick kick off your process, I get an immediate error a portion of the time. When it does hit an error, it always does so right away on plan compilation. Alternatively, you can add "OPTION RECOMPILE" to the COUNT(*) in the loop, forcing a new plan to compile at each trial. Using this approach, I see the error almost immediately every time I have run your scripts.
I was able to reproduce the error with a controlled series of steps that seems to hit the error on every trial, removing the need to set up a loop and rely on randomness.
I also threw in a potential fix (to use ALTER TABLE...SWITCH) which might be possible to try out in your production environment. Now, on to the details!
Here are the steps to reproduce:
-- Instructions: Process each step one at a time, following any instructions in that step
-- (1) Initial setup: Clean any relics and create the test table
DBCC TRACEOFF(-1,3604,1200) WITH NO_INFOMSGS;
SET NOCOUNT ON;
IF @@TRANCOUNT > 0 ROLLBACK
IF object_id('test') IS NOT NULL
DROP TABLE test
IF object_id('TMP_test') IS NOT NULL
DROP TABLE TMP_test
IF object_id('test1') IS NOT NULL
DROP TABLE test1
CREATE TABLE test(Id INT IDENTITY CONSTRAINT PK_test PRIMARY KEY)
GO
INSERT test DEFAULT VALUES
GO 2000
-- (2) Run the COUNT(*)
-- This will acquire the Sch-S lock, and we use HOLDLOCK to retain that lock.
-- This simulates a race condition where this query is running at the time the first rename occurs.
BEGIN TRANSACTION
SELECT COUNT(*) FROM Test WITH (HOLDLOCK)
GO
-- (3) In another window, run the block of code that performs the renames
-- This will be blocked, waiting for a Sch-M lock on "test" until we complete step 4 below
CREATE TABLE TMP_test(Id INT PRIMARY KEY)
INSERT TMP_test SELECT * FROM test
EXEC sp_rename 'test', 'test1'
EXEC sp_rename 'TMP_test', 'test'
EXEC sp_rename 'test1', 'TMP_test'
DROP TABLE TMP_test
GO
-- (4) Now, commit the original COUNT(*) and immediately fire off the query again
-- We use OPTION (RECOMPILE) to make sure we need to compile a new query plan
-- The COMMIT releases the Sch-S lock, allowing (3) to acquire the Sch-M lock
-- This batch will now be waiting for the Sch-S lock again, on the same object_id,
-- but that object_id will no longer point to the correct object by the time the lock
-- is acquired.
COMMIT
SELECT COUNT(*) FROM Test OPTION (RECOMPILE)
GOUsing these steps, we can dig in a little bit more to what is causing the error. To do so, I ran a trace with the following events on: SP:StmtStarting, SP:StmtCompleted, Lock:Acquired, Lock:Released.
What I found is that the following occurs, in order (but omitting some detail in between):
- The first COUNT(*) acquires a Sch-S lock
- The first COUNT(*) acquires some locks on statistics (presumably to build a query plan)
- The first COUNT(*) releases a Sch-S lock (end of plan compilation)
- The first COUNT(*) acquires a IS lock (beginning of execution), then runs successfully
- The sp_rename runs up to the point where it needs to take a Sch-M lock, then is blocked by the first COUNT(*), which is not yet committed
- The first COUNT(*) is committed
- The sp_renames acquires the Sch-M lock
- The second COUNT(*) requests a Sch-S lock and is added to the blocking chain behind the sp_rename
- The sp_renames complete (at this point, the second COUNT(*) is waiting on a Sch-S lock for an object_id that is no longer the correct object_id)
- The second COUNT(*) acquires the Sch-S lock
- The second COUNT(*) acquires a Sch-S lock on sys.sysschobjs, presumably as a sanity check because it has detected (or needs to confirm) that something is amiss with Sch-S lock it was granted on the object_id that used to be "test"
- The Invalid object name 'Test' error is thrown
However, if a similar sequence of events occurs when a plan compilation is not needed (such as when running your loop), SQL Server is actually smart enough to detect that the object_id has changed. I also traced that case, and I found a situation where the second COUNT(*) performed the following sequence
- Acquires the Sch-S lock (on the object_id that used to be "test")
- Acquires a Sch-S lock on sys.sysschobjs
- Releases both of these locks
- Acquires a IS lock on a different object_id; the new object_id of "test"!
- Runs successfully to completion
So it does look like this adaptive logic is in place, as you hypothesized. But it looks like it may be in place only for query execution, and not for query compilation.
As promised, here is an alternative snippet for section (3) that appears to provide a way of renaming
Code Snippets
-- Instructions: Process each step one at a time, following any instructions in that step
-- (1) Initial setup: Clean any relics and create the test table
DBCC TRACEOFF(-1,3604,1200) WITH NO_INFOMSGS;
SET NOCOUNT ON;
IF @@TRANCOUNT > 0 ROLLBACK
IF object_id('test') IS NOT NULL
DROP TABLE test
IF object_id('TMP_test') IS NOT NULL
DROP TABLE TMP_test
IF object_id('test1') IS NOT NULL
DROP TABLE test1
CREATE TABLE test(Id INT IDENTITY CONSTRAINT PK_test PRIMARY KEY)
GO
INSERT test DEFAULT VALUES
GO 2000
-- (2) Run the COUNT(*)
-- This will acquire the Sch-S lock, and we use HOLDLOCK to retain that lock.
-- This simulates a race condition where this query is running at the time the first rename occurs.
BEGIN TRANSACTION
SELECT COUNT(*) FROM Test WITH (HOLDLOCK)
GO
-- (3) In another window, run the block of code that performs the renames
-- This will be blocked, waiting for a Sch-M lock on "test" until we complete step 4 below
CREATE TABLE TMP_test(Id INT PRIMARY KEY)
INSERT TMP_test SELECT * FROM test
EXEC sp_rename 'test', 'test1'
EXEC sp_rename 'TMP_test', 'test'
EXEC sp_rename 'test1', 'TMP_test'
DROP TABLE TMP_test
GO
-- (4) Now, commit the original COUNT(*) and immediately fire off the query again
-- We use OPTION (RECOMPILE) to make sure we need to compile a new query plan
-- The COMMIT releases the Sch-S lock, allowing (3) to acquire the Sch-M lock
-- This batch will now be waiting for the Sch-S lock again, on the same object_id,
-- but that object_id will no longer point to the correct object by the time the lock
-- is acquired.
COMMIT
SELECT COUNT(*) FROM Test OPTION (RECOMPILE)
GO-- (3) In another window, run the block of code that performs the "renames"
-- Instead of sp_rename, we use ALTER TABLE...SWITCH
-- This appears to be a more robust way of performing the same logic
CREATE TABLE test1(Id INT PRIMARY KEY)
CREATE TABLE TMP_test(Id INT PRIMARY KEY)
INSERT TMP_test SELECT * FROM test
ALTER TABLE test SWITCH TO test1
ALTER TABLE TMP_test SWITCH TO test
ALTER TABLE test1 SWITCH TO TMP_test
DROP TABLE TMP_test
DROP TABLE test1
GOContext
StackExchange Database Administrators Q#89777, answer score: 6
Revisions (0)
No revisions yet.