patternsqlModerate
What's the best way to archive all but current year and partition the table at the same time
Viewed 0 times
samethebestyearwhatallpartitionbutwaytime
Problem
Task
Archive off all but a rolling 13 month period from a group of large tables. The archived data must be stored in another database.
The "archive" table will have about 1.1 billion rows, the "live" table about 400 million. Obviously the archive table will increase over time, but I expect the live table to increase reasonably quickly too. Say 50% in the next couple of years at least.
I had thought about Azure stretch databases but unfortunately we are at 2008 R2 and likely to stay there for a while.
Current Plan
Problem: I'm trying to move the data into the initial partitioned tables (in fact I'm still doing a proof of concept on it). I'm trying to use TF 610 (as per the Data Loading Performance Guide) and an
At this point I'm thinking my best bet may be to move the data using an SSIS package. I'm trying
Archive off all but a rolling 13 month period from a group of large tables. The archived data must be stored in another database.
- The database is in simple recovery mode
- The tables are 50 mil rows to several billion and in some cases take up hundreds of gb each.
- The tables are currently not partitioned
- Each table has one clustered index on an ever increasing date column
- Each table additionally has one non-clustered index
- All data changes to the tables are inserts
- The goal is to minimize downtime of the primary database.
- Server is 2008 R2 Enterprise
The "archive" table will have about 1.1 billion rows, the "live" table about 400 million. Obviously the archive table will increase over time, but I expect the live table to increase reasonably quickly too. Say 50% in the next couple of years at least.
I had thought about Azure stretch databases but unfortunately we are at 2008 R2 and likely to stay there for a while.
Current Plan
- Create a new database
- Create new tables partitioned by month (using the modified date) in the new database.
- Move the most recent 12-13 months of data into the partitioned tables.
- Do a rename swap of the two databases
- Delete the moved data from the now "archive" database.
- Partition each of the tables in the "archive" database.
- Use partition swaps to archive the data in the future.
- I do realize that I'll have to swap out the data to be archived, copy that table to the archive database, and then swap it into the archive table. This is acceptable.
Problem: I'm trying to move the data into the initial partitioned tables (in fact I'm still doing a proof of concept on it). I'm trying to use TF 610 (as per the Data Loading Performance Guide) and an
INSERT...SELECT statement to move the data initially thinking it would be minimally logged. Unfortunately every time I try it's fully logged.At this point I'm thinking my best bet may be to move the data using an SSIS package. I'm trying
Solution
Why are you not getting minimal logging?
I have found the Data Loading Performance Guide, which you reference, to be an extremely valuable resource. However, it is also not 100% comprehensive, and I suspect that the grid is complex enough already that the author did not add a column
Recommended approach
Based upon the recommendations in the Data Loading Performance Guide (including the "Bulk Loading a Partitioned Table" section) as well as extensive experience loading partitioned tables with tens of billions of rows, here is the approach I would recommend:
The differences when compared to your original approach:
Comparison of approaches for moving the recent year of data
In order to compare approaches in a reasonable amount of time on my machine, I used a
As you can see from the results below, there is a large performance boost and reduction in log writes by loading data into a heap using the
Here is the full test script.
Final notes
All of these results depend on your hardware to some degree. However, my tests were conducted on a standard quad-core laptop with spinning disk drive. It's likely that the data loads should be much faster if you are using a decent server that does not have a lot of other load at the time you are conducting this process.
For example, I ran the recommended approach on an actual dev server (Dell R720) and saw a reduction to
I have found the Data Loading Performance Guide, which you reference, to be an extremely valuable resource. However, it is also not 100% comprehensive, and I suspect that the grid is complex enough already that the author did not add a column
Table Partitioning to break out differences in behavior depending on whether the table receiving the inserts is partitioned. As we will see later, the fact that table is already partitioned appears to inhibit minimal logging.Recommended approach
Based upon the recommendations in the Data Loading Performance Guide (including the "Bulk Loading a Partitioned Table" section) as well as extensive experience loading partitioned tables with tens of billions of rows, here is the approach I would recommend:
- Create a new database.
- Create new tables partitioned by month in the new database.
- Move the most recent year of data, in the following fashion:
- For each month, create a new heap table;
- Insert that month of data into the heap using the TABLOCK hint;
- Add the clustered index to the heap containing that month of data;
- Add the check constraint enforcing that the table contains just this month's data;
- Switch the table into the corresponding partition of the new overall partitioned table.
- Do a rename swap of the two databases.
- Truncate the data in the now "archive" database.
- Partition each of the tables in the "archive" database.
- Use partition swaps to archive the data in the future.
The differences when compared to your original approach:
- The methodology of moving the recent 12-13 months of data will be much more efficient if you load into a heap with
TABLOCKone month at a time, using partition switching to place the data into the partitioned table.
- A
DELETEto clear away old table will be fully logged. Perhaps you can eitherTRUNCATEor drop the table and create a new archive table.
Comparison of approaches for moving the recent year of data
In order to compare approaches in a reasonable amount of time on my machine, I used a
100MM row test data set that I generated and that follows your schema. As you can see from the results below, there is a large performance boost and reduction in log writes by loading data into a heap using the
TABLOCK hint. There is an additional benefit if this is done one partition at a time. It's also worth noting that the one-partition-at-a-time method can easily be parallelized further if you run multiple partitions at once. Depending on your hardware, that might yield a nice boost; we typically load at least four partitions at once on server-class hardware.Here is the full test script.
Final notes
All of these results depend on your hardware to some degree. However, my tests were conducted on a standard quad-core laptop with spinning disk drive. It's likely that the data loads should be much faster if you are using a decent server that does not have a lot of other load at the time you are conducting this process.
For example, I ran the recommended approach on an actual dev server (Dell R720) and saw a reduction to
76 seconds (from 156 seconds on my laptop). Interestingly, the original approach of inserting into a partitioned table did not experience the same improvement and still took just over 12 minutes on the dev server. Presumably this is because this pattern yields a serial execution plan, and a single processor on my laptop can match a single processor on the dev server.Context
StackExchange Database Administrators Q#140772, answer score: 11
Revisions (0)
No revisions yet.