patternsqlMinor
Database design for big table
Viewed 0 times
designdatabasebigfortable
Problem
I'm handling a big table which is at 200GB+ and has around 1 billion rows.
I'm looking into splitting the table into smaller tables and then joining the results whenever I query.
We do mainly reading on that table with occasional writing.
It it a good idea? Or, is it something I need not worry?
My database server has 4CPU and 32GB RAM.
I have the primary key as index to save space. Currently, there are no issues. However, I just considering whether to split it earlier or just keep it this way. What are the pros and cons of splitting into smaller tables with , let's say, 250 million rows each instead of plain 1 billion rows table? Do you think my 32GB RAM will be able to handle it?
I'm looking into splitting the table into smaller tables and then joining the results whenever I query.
We do mainly reading on that table with occasional writing.
It it a good idea? Or, is it something I need not worry?
My database server has 4CPU and 32GB RAM.
I have the primary key as index to save space. Currently, there are no issues. However, I just considering whether to split it earlier or just keep it this way. What are the pros and cons of splitting into smaller tables with , let's say, 250 million rows each instead of plain 1 billion rows table? Do you think my 32GB RAM will be able to handle it?
Solution
There are a couple of patterns for splitting a single logical entity across several physical tables.
Vertical partitioning puts some of the columns in one table and some in another. There can be several such additional tables, if required. All these tables share the same primary key. Columns that are used together are stored together so one page fetch read all the required values. The advantage is there are more rows per page so scans and aggregates require fewer IOs. Sometimes it can be tricky deciding which columns should go together. Every INSERT becomes proportionately more expensive as the number of partitions increases.
Horizontal partitioning splits the data by key range. For example, all users with a surname beginning A through M go into table User_1 and N through Z go into User_2. The application figures out which sub-table to use at run-time, usually algorithmically or through a look-up, though most DBMS these days will offer this as a build-in feature, implemented through DDL, transparent to the application. If you have hot spots in data writes this can spread the pain out. The optimiser can eliminate whole partitions from range scans, improving response time. Loading and removing whole partitions can be very fast metadata operations.
Sharding is where key ranges are moved, not just to a different table, but to a whole different instance of the DBMS. The application decides which instance to connect to, depending on key range. This is a scale-out technique. Some DBMS support this as a feature, such as Galera for MariaDB. The obvious cost is in additional hardware to run the other instances, duplication of reference data across all nodes to maintain RI and application complexity.
The techniques may be combined, of course, so a table could be both horizontally and vertically partitioned.
General advantages of splitting:
Disadvantages include:
Basically, if you don't have a problem, and don't see one coming, don't bother. Horizontal partitioning using DBMS functionality is the only one worth considering as a pre-emptive measure. Even then it will require a re-write of 200GB. Do you have the disk and service window to accomplish this?
Vertical partitioning puts some of the columns in one table and some in another. There can be several such additional tables, if required. All these tables share the same primary key. Columns that are used together are stored together so one page fetch read all the required values. The advantage is there are more rows per page so scans and aggregates require fewer IOs. Sometimes it can be tricky deciding which columns should go together. Every INSERT becomes proportionately more expensive as the number of partitions increases.
Horizontal partitioning splits the data by key range. For example, all users with a surname beginning A through M go into table User_1 and N through Z go into User_2. The application figures out which sub-table to use at run-time, usually algorithmically or through a look-up, though most DBMS these days will offer this as a build-in feature, implemented through DDL, transparent to the application. If you have hot spots in data writes this can spread the pain out. The optimiser can eliminate whole partitions from range scans, improving response time. Loading and removing whole partitions can be very fast metadata operations.
Sharding is where key ranges are moved, not just to a different table, but to a whole different instance of the DBMS. The application decides which instance to connect to, depending on key range. This is a scale-out technique. Some DBMS support this as a feature, such as Galera for MariaDB. The obvious cost is in additional hardware to run the other instances, duplication of reference data across all nodes to maintain RI and application complexity.
The techniques may be combined, of course, so a table could be both horizontally and vertically partitioned.
General advantages of splitting:
- Indexes may becomes shallower, saving costs on index look-ups.
- Each partition can write to separate disk, spreading IO load.
Disadvantages include:
- Application complexity
- It will be more difficult to ensure data integrity
- Upgrades become trickier.
- Purging data is incrementally more difficult to get right.
Basically, if you don't have a problem, and don't see one coming, don't bother. Horizontal partitioning using DBMS functionality is the only one worth considering as a pre-emptive measure. Even then it will require a re-write of 200GB. Do you have the disk and service window to accomplish this?
Context
StackExchange Database Administrators Q#125216, answer score: 5
Revisions (0)
No revisions yet.