HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Finding missing gaps of data in a table with ~2.5 Million rows

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
millionrowswithfindingmissingdatatablegaps

Problem

I work for an economics consulting company and all of our data is housed in SQL Server 11 (2012). Our data essentially consists of a date column, a raw data column, several calculations columns and a column that has a shortcode to distinguish each set of data from the next.

We have thousands of these sets that are all loaded into the same table (~2.5 Mil rows total) and then drawn from and ordered with a query. They are generally ordered by date, starting with the first month and moving month by month until the last month. The start and end dates of each series varies from just a year's worth of time to 100+ years.

Recently, we've been having some issues where random sections of data just disappear. The entire row just up and leaves, which makes finding these missing rows somewhat of a challenge without combing through every single month to check if a month is missing, a somewhat impossible task for a 2.5 million row table.

My boss has tasked me with writing a query/stored procedure that will look through this giant table and look to see which sets having missing rows and where they are.

I've been trying to work through this problem that is a little bit above where my SQL skills are and I can't seem to find anybody that has had a similar issue anywhere on the web. I'm going to go through what I already have and maybe someone can at least tell me if I'm headed in the right direction and possibly provide some insight as to where I should go from here.

The best solution I could find on the web was using a CTE to create a temp table of dates and then compare them with the original table. This works great if I was just scanning for problems in one particular data set, but I have many data sets within the same table, all with different starting and end dates. So I went with it anyways in hopes that I could eventually expand it to search through the lot of them. Here's my code:

```
declare @startDate Date, @endDate Date
set @startDate = '2000-01-01'
set @endDate =

Solution

To start, while with only 202 months to check it won't be a huge issue, a recursive CTE is generally the worst possible way to derive a set, in terms of performance (I prove this here and here).

If you're going to be running this query more than once (and it sounds like you will be, until you solve the separate issue of who/what is deleting this data and creating the gaps in the first place), why not just build a months table that will always be there?

CREATE TABLE dbo.Months([Month] date PRIMARY KEY);

DECLARE @StartDate     date = '20000101', 
        @NumberOfYears int  = 30;

INSERT dbo.Months([Month])
  SELECT TOP (12*@NumberOfYears) 
  DATEADD(MONTH, ROW_NUMBER() OVER (ORDER BY number) -1, @StartDate) 
FROM master.dbo.spt_values;


30 years of months, which will work through the year 2029, stored in a whopping 72kb. When I first wrote this I sarcastically emphasized whopping, but I should explain why this has 9 pages instead of the expected 2. In current versions of SQL Server (I initially tested this on SQL Server 2016, but the same is true in v.Next), the storage engine reserves an entire, uniform extent for new objects. This is 8 x 8kb pages, plus the IAM page for 72kb - in this case only one of the data pages is actually required, so 7 remain unallocated. This means they won't show up in all catalog views, but they're still easy to find (click to enlarge):

You can turn this behavior off for user databases, but personally I wouldn't (they made it the default for a reason). Your first instinct might be about saving memory rather than disk space, but while this puts 72kb on disk, only 16kb will ever be loaded into the buffer pool. So no need to panic about that.

Now your query can be:

DECLARE @startDate date = '20000101', @endDate date = '20161101';

;WITH shortcodes AS
(
  SELECT DISTINCT ShortCode 
  FROM dbo.VWTBL_INDICATOR
  WHERE MonthYear >= @startDate AND MonthYear = @startDate AND m.[Month] <= @endDate
AND vwtbl.MonthYear IS NULL;


Note that currently this will identify all months in your defined range where a ShortCode doesn't appear, even if it's outside the range that is valid for that ShortCode. If those valid ranges per ShortCode are defined somewhere, please add that information to the question.

What on earth is a "VWTBL"?

Code Snippets

CREATE TABLE dbo.Months([Month] date PRIMARY KEY);

DECLARE @StartDate     date = '20000101', 
        @NumberOfYears int  = 30;

INSERT dbo.Months([Month])
  SELECT TOP (12*@NumberOfYears) 
  DATEADD(MONTH, ROW_NUMBER() OVER (ORDER BY number) -1, @StartDate) 
FROM master.dbo.spt_values;
DECLARE @startDate date = '20000101', @endDate date = '20161101';

;WITH shortcodes AS
(
  SELECT DISTINCT ShortCode 
  FROM dbo.VWTBL_INDICATOR
  WHERE MonthYear >= @startDate AND MonthYear <= @endDate
)
SELECT m.[Month], s.ShortCode 
FROM dbo.Months AS m
CROSS JOIN shortcodes AS s
LEFT OUTER JOIN dbo.VWTBL_INDICATOR AS vwtbl
ON s.ShortCode = vwtbl.ShortCode
AND m.[Month] = vwtbl.MonthYear
WHERE m.[Month] >= @startDate AND m.[Month] <= @endDate
AND vwtbl.MonthYear IS NULL;

Context

StackExchange Database Administrators Q#159853, answer score: 9

Revisions (0)

No revisions yet.