patternsqlModerate
Most cost efficient way to page through a poorly ordered table?
Viewed 0 times
tableefficientwaypoorlypagethroughcostmostordered
Problem
I have a table that has three columns:
HashUID1, HashUID2, Address_Name (which is a textual email address, and the previous two hash colunms are of some crazy creation to link event participant tables to email addresses. its ugly, it barely works its out of my control. Focus on the address_name index)
It has 78 million rows. Not properly sorted. Regardless, this index is split onto a lot of fast LUN's and performs REALLY fast index seeks.
I need to create a series of queries to extract only 20,000 "rows per page" at a time, but avoid conflicts or dupes. Since there is no identity column, or easily ordered column, is there an easy way to select all, and page through it?
Am I correct in saying that if I do a select * from hugetablewithemails into a temp table, then select through it by row_number that the table remains in memory for the duration of the transaction, which, to me, is an excessive amount of memory resources?
This seems the preferred method of paging. I'd rather page by statistical percentages. :(
There is one index which maintains the address_name email address in order, and is well maintained. For the past week I have been meaning to help this other developer by spending some time on looking into building a proc that spits out ranges based on windowing functions based on statistics (which I am not great at, but this query really interested me) to provide a range of characters 1 through (variable) LEFT LIKE chars of the index, that meets 20,000 rows--But I have not had time to even start the query...
Couple questions:
-
Any suggestions? Not looking for actual code, just some hints or suggestions based on experiences, maybe caveats. I want to avoid additional index scans after the initial scan.
-
Is this the right approach?
-
I'm thinking of breaking the sum of the index of all email addresses, gathering rowcount(*), /20,000, and usinng that as a windowing function to group min/max substring(1,5) values based on percentages of total rowcount
HashUID1, HashUID2, Address_Name (which is a textual email address, and the previous two hash colunms are of some crazy creation to link event participant tables to email addresses. its ugly, it barely works its out of my control. Focus on the address_name index)
It has 78 million rows. Not properly sorted. Regardless, this index is split onto a lot of fast LUN's and performs REALLY fast index seeks.
I need to create a series of queries to extract only 20,000 "rows per page" at a time, but avoid conflicts or dupes. Since there is no identity column, or easily ordered column, is there an easy way to select all, and page through it?
Am I correct in saying that if I do a select * from hugetablewithemails into a temp table, then select through it by row_number that the table remains in memory for the duration of the transaction, which, to me, is an excessive amount of memory resources?
This seems the preferred method of paging. I'd rather page by statistical percentages. :(
There is one index which maintains the address_name email address in order, and is well maintained. For the past week I have been meaning to help this other developer by spending some time on looking into building a proc that spits out ranges based on windowing functions based on statistics (which I am not great at, but this query really interested me) to provide a range of characters 1 through (variable) LEFT LIKE chars of the index, that meets 20,000 rows--But I have not had time to even start the query...
Couple questions:
-
Any suggestions? Not looking for actual code, just some hints or suggestions based on experiences, maybe caveats. I want to avoid additional index scans after the initial scan.
-
Is this the right approach?
-
I'm thinking of breaking the sum of the index of all email addresses, gathering rowcount(*), /20,000, and usinng that as a windowing function to group min/max substring(1,5) values based on percentages of total rowcount
Solution
Essentially, you are asking if you can perform a single ordered scan through the data overall, while making no copies of the data, and returning 'x' disjoint sets of rows from the full set on each call. This is exactly the behaviour of an appropriately-configured API cursor.
For example, using the AdventureWorks table
Each fetch operation returns a maximum of 1,000 rows, remembering the position of the scan from the previous call.
For example, using the AdventureWorks table
Person.EmailAddress to return sets of 1,000 rows:DECLARE
@cur integer,
-- FAST_FORWARD | AUTO_FETCH | AUTO_CLOSE
@scrollopt integer = 16 | 8192 | 16384,
-- READ_ONLY, CHECK_ACCEPTED_OPTS, READ_ONLY_ACCEPTABLE
@ccopt integer = 1 | 32768 | 65536,
@rowcount integer = 1000,
@rc integer;
-- Open the cursor and return the first 1,000 rows
EXECUTE @rc = sys.sp_cursoropen
@cur OUTPUT,
N'
SELECT *
FROM AdventureWorks2012.Person.EmailAddress
WITH (INDEX([IX_EmailAddress_EmailAddress]))
ORDER BY EmailAddress;
',
@scrollopt OUTPUT,
@ccopt OUTPUT,
@rowcount OUTPUT;
IF @rc <> 16 -- FastForward cursor automatically closed
BEGIN
-- Name the cursor so we can use CURSOR_STATUS
EXECUTE sys.sp_cursoroption
@cur,
2,
'MyCursorName';
-- Until the cursor auto-closes
WHILE CURSOR_STATUS('global', 'MyCursorName') = 1
BEGIN
EXECUTE sys.sp_cursorfetch
@cur,
2,
0,
1000;
END;
END;Each fetch operation returns a maximum of 1,000 rows, remembering the position of the scan from the previous call.
Code Snippets
DECLARE
@cur integer,
-- FAST_FORWARD | AUTO_FETCH | AUTO_CLOSE
@scrollopt integer = 16 | 8192 | 16384,
-- READ_ONLY, CHECK_ACCEPTED_OPTS, READ_ONLY_ACCEPTABLE
@ccopt integer = 1 | 32768 | 65536,
@rowcount integer = 1000,
@rc integer;
-- Open the cursor and return the first 1,000 rows
EXECUTE @rc = sys.sp_cursoropen
@cur OUTPUT,
N'
SELECT *
FROM AdventureWorks2012.Person.EmailAddress
WITH (INDEX([IX_EmailAddress_EmailAddress]))
ORDER BY EmailAddress;
',
@scrollopt OUTPUT,
@ccopt OUTPUT,
@rowcount OUTPUT;
IF @rc <> 16 -- FastForward cursor automatically closed
BEGIN
-- Name the cursor so we can use CURSOR_STATUS
EXECUTE sys.sp_cursoroption
@cur,
2,
'MyCursorName';
-- Until the cursor auto-closes
WHILE CURSOR_STATUS('global', 'MyCursorName') = 1
BEGIN
EXECUTE sys.sp_cursorfetch
@cur,
2,
0,
1000;
END;
END;Context
StackExchange Database Administrators Q#82715, answer score: 17
Revisions (0)
No revisions yet.