debugsqlMinor
Random Error 824 - All Environments
Viewed 0 times
randomerrorall824environments
Problem
We have been experiencing seemingly random error 824 on our DEV/QA/PROD SQL 2017 Enterprise servers. The servers run nearly identical code ingesting identical daily files via an ETL process into our data warehouse. The errors were first noticed around May 2022, but due to log cleanups, we can't be sure as the (vendor supplied) ETL process is catching these errors, logging a warning and continues processing instead of failing!
DEV/QA have been patched to CU30 (latest CU) -- the condition still exists.
Production is a few patches behind at CU22, and is scheduled to be patched in the coming weeks.
Example:
SQL Server detected a logical inconsistency-based I/O error: incorrect
checksum (expected 0xc30164e7; actual 0x9f2bc675c). It occurred
during a read of page (7:1306400) in database ID 2 at offset
0x0000027de40000 in file 'H:\tempdb_mssql_6.ndf'.
As stated, this is happening randomly on all of our environments. All servers are virtualized.
DEV/QA are both utilizing the same SAN. Production is on a separate SAN in a different data centre. I do not have details on make/model of the SAN devices.
For the most part, when this happens, it seems to mostly be in tempdb (but, not always). Also, suspect_pages is more often than not, empty. It also appears to happen more often on Saturdays as we had it happen 3-4 in a row.
Also noted, is that the expected/actual values listed in the error, are often the same -- but not always.
It has also been noted that a particular stored procedure seems to be more susceptible to this error being thrown, but, it has happened at multiple other places in the ETL job, again impacting different databases. The stored procedure that seems to trigger this error most often adds a PERSISTED computed column, then a ROW_NUMBER() based on that computed column -- to 5 tables, ranging from 200K to 7.5M rows in size. We modified this procedure yesterday (in QA) to limit the number of rows updated with the ROW_NUMBER() value (only where rownu
DEV/QA have been patched to CU30 (latest CU) -- the condition still exists.
Production is a few patches behind at CU22, and is scheduled to be patched in the coming weeks.
Example:
SQL Server detected a logical inconsistency-based I/O error: incorrect
checksum (expected 0xc30164e7; actual 0x9f2bc675c). It occurred
during a read of page (7:1306400) in database ID 2 at offset
0x0000027de40000 in file 'H:\tempdb_mssql_6.ndf'.
As stated, this is happening randomly on all of our environments. All servers are virtualized.
DEV/QA are both utilizing the same SAN. Production is on a separate SAN in a different data centre. I do not have details on make/model of the SAN devices.
For the most part, when this happens, it seems to mostly be in tempdb (but, not always). Also, suspect_pages is more often than not, empty. It also appears to happen more often on Saturdays as we had it happen 3-4 in a row.
Also noted, is that the expected/actual values listed in the error, are often the same -- but not always.
It has also been noted that a particular stored procedure seems to be more susceptible to this error being thrown, but, it has happened at multiple other places in the ETL job, again impacting different databases. The stored procedure that seems to trigger this error most often adds a PERSISTED computed column, then a ROW_NUMBER() based on that computed column -- to 5 tables, ranging from 200K to 7.5M rows in size. We modified this procedure yesterday (in QA) to limit the number of rows updated with the ROW_NUMBER() value (only where rownu
Solution
82x errors are rarely, if ever, a SQL Server issue as these are all error codes on something that doesn't make sense or has errored when returning to SQL Server code. For example, with 824 this can be caused by a bad driver or caching layer below SQL (including hardware and software such as the hypervisor, SAN, disk controller, etc.).
There are a few things to try to narrow down the issue:
There are a few things to try to narrow down the issue:
- Run SQLIOSIM to see if any errors occur. If they do, it's not SQL Server.
- Check for known issues for your hypervisor - such as VMWare drivers which cause 823/824 errors and make appropriate configuration changes.
- Take and look at a storport trace to see if there are other issues that might be helping to cause the problem during the time period and the data returned.
Context
StackExchange Database Administrators Q#315705, answer score: 7
Revisions (0)
No revisions yet.