HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Archiving of old data

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
archivingolddata

Problem

We're currently running into some performance problems since our database is getting too big. There are data stored from the last 10 years and I don't see a reason why the data older than 2 years have to be stored in the same tables as the new data.

Now since I don't have very profound experience in administrating databases, I'm looking for the best ways to archive old data.

Info

-
There are about 310'000'000 records in the database in total.

-
The database needs 250 GB on the hard disk.

  • The Server version is SQL Server 2008 with compatibility level SQL Server 2005 (90), but we're planning on upgrading to SQL Server 2012 soon



I've thought about two possibilities:

New Database

Create Database similar to the one on the production server and insert all the old data in the new database.

  • Disadvantage: Since linked servers are not allowed in our environment, it would be difficult to join the old data if needed



History Schema

Create a new schema f.e. [hist] with the same tables as in the production database. Insert all old data in these new tables in the new schema.

  • Advantage: Easy joining, if old data would be needed in the future



  • Do you prefere one of the solutions over the other?



  • Why?



  • Are there any better possibilities?



  • Are there existing tools with which this task is easily possible?



  • Any other thoughts?



Thanks in advance

Edit

Additional Question:

Would the newly created archive table also need primary / foreign keys?

Or should they just have the columns but without keys / constraints?

Solution

I think the answer to many of your questions is that it depends. What performance problems are you having? It seems unusual that a database would have performance problems just from growing to 250GB in size.

Perhaps your queries are performing table scans on the entire fact table even when only a small portion (e.g., the last year) of the date range is needed? If there is a particular query that is most important to optimize, considering posting your schema, query, and an actual execution plan in another question to see if it can be optimized.



Do you prefer one of the solutions over the other?


I generally prefer the history database, and I think Guy describes good reasons for this in his response.

The primary disadvantage I see for a history database (as opposed to a schema) is that you can no longer use foreign keys for your archive table. This may be fine for you, but it's something to be aware of.

The disadvantage you listed for this approach is not accurate; you will be able to query across databases on the same server easily and the query optimizer generally handles cross-database queries very well.



Are there any better possibilities?


If you need to query the archive data regularly, I might consider partitioning the table by date. However, this is a big change that can come with a lot of performance implications, both positive (e.g., partition elimination, more efficient data loading) and negative (e.g., slower singleton seeks, greater potential for thread skew in parallel queries). So I wouldn't make this decision lightly if it's a heavily used database.



Would the newly created archive table also need primary / foreign keys?
Or should they just have the columns but without keys / constraints?


I would recommend having at least the primary key and unique indexes so that you can get the data integrity benefits they provide. For example, this will prevent you from accidentally inserting a year of data into the history table twice. And as a side benefit it may improve performance if you do need to query the history table.



Any other thoughts?


Since you are using Enterprise edition and planning to upgrade to SQL 2008+, you might consider data compression for this table. Compression will certainly reduce disk space, but depending on your server's disk and CPU resources it may also improve query performance for reads by reducing disk I/O and improving memory utilization (more data fits in cache at once).

Context

StackExchange Database Administrators Q#117146, answer score: 14

Revisions (0)

No revisions yet.