patternsqlMajor
Why is the default page size for databases so small?
Viewed 0 times
whythedatabasessizedefaultsmallpagefor
Problem
In PostgreSQL and SQL Server, the default page size is 8 KB, in MySQL is 16 KB and in IBM DB2 and Oracle it is only 4 KB.
Why are these page sizes so small?
Is there a historical or memory usage reason?
Why are these page sizes so small?
Is there a historical or memory usage reason?
Solution
I'm running Linux (Fedora 34, 64 bit, two cores, four CPUs, 32 GB RAM - PostgreSQL 13.3).
If I run
Note:
Now, even if you have a file that is two bytes long (
The disk itself has its own "atomic" unit. With HDDs, this was normally 512 bytes, but see the link above - "and at the hardware level the old drives used 512B sectors while new devices often write data in larger chunks (often 4kB or even 8kB)". See here for HDDs and here for SSDs. (Thanks to @RonJohn for his comment).
Likewise, the database will read data in and out in blocks (also called pages - the terminology can be confusing) - if you change one solitary bit in a record, the database will still have to read the entire page the record is on and write the entire page back onto disk once the modification is completed.
On PostgreSQL, you have the default block size of 8K.
It's important that there not be too big a gap between the HDD, the OS and the RDBMS "atomic unit" sizes - otherwise, you run the risk of torn pages - from the link:
Avoiding Torn Pages
For the Postgres file layout, Postgres reads and writes data to disk
8kb at a time. Most operating systems make use of a smaller page size,
such as 4kb. If Postgres is running on one of these operating systems,
an interesting edge case can occur. Since Postgres writes to disk in
units of 8kb and the OS writes to disk in units of 4kb, if the power
went out at just the right time, it is possible that only 4kb of an
8kb write Postgres was performing were written to disk. This edge case
is sometimes referred to as “torn pages”. Postgres does have a way of
working around torn pages, but it does increase the amount of I/O
Postgres needs to perform.
Also, see here:
Partial Writes / Torn Pages
So what are full page writes about? As the comment in postgresql.conf
says it’s a way to recover from partial page writes – PostgreSQL uses
8kB pages (by default), but other parts of the stack use different
chunk sizes. Linux filesystems typically use 4kB pages (it’s possible
to use smaller pages, but 4kB is the max on x86), and at the hardware
level the old drives used 512B sectors while new devices often write
data in larger chunks (often 4kB or even 8kB).
So when PostgreSQL writes the 8kB page, the other layers of the
storage stack may break this into smaller chunks, managed separately.
This presents a problem regarding write atomicity. The 8kB PostgreSQL
page may be split into two 4kB filesystem pages, and then into 512B
sectors. Now, what if the server crashes (power failure, kernel bug,
…)?
As with so much related to computer science, it's a matter of trade-offs and compromises - here's a benchmark of PostgreSQL run against the same system just changing the block size - from the post:
So, you can see that a naïve, "make the db block size as big as possible" approach doesn't work very well. All I will say about this is that database benchmarks are a total quagmire... for some applications 1 MB may well be suitable - although straying beyond 16 KB would require considerable justification. Systems' default parameters are just that - defaults - chosen to be reasonably good under the widest range of circumstances...
Re. the historical part of the question - yes, a lot of it relates to history when disks came in 512 byte sectors... HDDs, despite advances in speed and incremental improvements in the composition of the rust, are essentially unchanged since the first HDDs - HDD performance has fallen way behind that of CPUs and RAM... capacity has in
If I run
stat -f some_random_file as follows:[pol@fedora inst]$ stat -f blah.txt
File: "blah.txt"
ID: f1b798b1610e7067 Namelen: 255 Type: ext2/ext3
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 322411548 Free: 316122834 Available: 299727775
Inodes: Total: 81960960 Free: 81739842
[pol@fedora inst]$Note:
Block size: 4096 = 4096 bytes = 32768 bits.Now, even if you have a file that is two bytes long (
"Hi") - it will still occupy 4096 bytes on disk - it's basically the minimum I/O that can be performed by the OS. The OS takes stuff off disk as 4K "chunks" and spits them back on in 4K chunks - see here for a quick overview. You might like to test on your own system.The disk itself has its own "atomic" unit. With HDDs, this was normally 512 bytes, but see the link above - "and at the hardware level the old drives used 512B sectors while new devices often write data in larger chunks (often 4kB or even 8kB)". See here for HDDs and here for SSDs. (Thanks to @RonJohn for his comment).
Likewise, the database will read data in and out in blocks (also called pages - the terminology can be confusing) - if you change one solitary bit in a record, the database will still have to read the entire page the record is on and write the entire page back onto disk once the modification is completed.
On PostgreSQL, you have the default block size of 8K.
test_1=# SELECT name, setting, short_desc, extra_desc FROM pg_settings WHERE name like '%block%' or short_desc LIKE '%block%';
name | setting | short_desc | extra_desc
----------------+---------+----------------------------------------------+------------
block_size | 8192 | Shows the size of a disk block. |
wal_block_size | 8192 | Shows the block size in the write ahead log. |
(2 rows)
test_1=#It's important that there not be too big a gap between the HDD, the OS and the RDBMS "atomic unit" sizes - otherwise, you run the risk of torn pages - from the link:
Avoiding Torn Pages
For the Postgres file layout, Postgres reads and writes data to disk
8kb at a time. Most operating systems make use of a smaller page size,
such as 4kb. If Postgres is running on one of these operating systems,
an interesting edge case can occur. Since Postgres writes to disk in
units of 8kb and the OS writes to disk in units of 4kb, if the power
went out at just the right time, it is possible that only 4kb of an
8kb write Postgres was performing were written to disk. This edge case
is sometimes referred to as “torn pages”. Postgres does have a way of
working around torn pages, but it does increase the amount of I/O
Postgres needs to perform.
Also, see here:
Partial Writes / Torn Pages
So what are full page writes about? As the comment in postgresql.conf
says it’s a way to recover from partial page writes – PostgreSQL uses
8kB pages (by default), but other parts of the stack use different
chunk sizes. Linux filesystems typically use 4kB pages (it’s possible
to use smaller pages, but 4kB is the max on x86), and at the hardware
level the old drives used 512B sectors while new devices often write
data in larger chunks (often 4kB or even 8kB).
So when PostgreSQL writes the 8kB page, the other layers of the
storage stack may break this into smaller chunks, managed separately.
This presents a problem regarding write atomicity. The 8kB PostgreSQL
page may be split into two 4kB filesystem pages, and then into 512B
sectors. Now, what if the server crashes (power failure, kernel bug,
…)?
As with so much related to computer science, it's a matter of trade-offs and compromises - here's a benchmark of PostgreSQL run against the same system just changing the block size - from the post:
Samsung SSD 840, 500 GB TPS (txns/second)
blocksize=2k 147.9
blocksize=4k 141.7
blocksize=8k 133.9
blocksize=16k 127.2
blocksize=1MB 42.5So, you can see that a naïve, "make the db block size as big as possible" approach doesn't work very well. All I will say about this is that database benchmarks are a total quagmire... for some applications 1 MB may well be suitable - although straying beyond 16 KB would require considerable justification. Systems' default parameters are just that - defaults - chosen to be reasonably good under the widest range of circumstances...
Re. the historical part of the question - yes, a lot of it relates to history when disks came in 512 byte sectors... HDDs, despite advances in speed and incremental improvements in the composition of the rust, are essentially unchanged since the first HDDs - HDD performance has fallen way behind that of CPUs and RAM... capacity has in
Code Snippets
[pol@fedora inst]$ stat -f blah.txt
File: "blah.txt"
ID: f1b798b1610e7067 Namelen: 255 Type: ext2/ext3
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 322411548 Free: 316122834 Available: 299727775
Inodes: Total: 81960960 Free: 81739842
[pol@fedora inst]$test_1=# SELECT name, setting, short_desc, extra_desc FROM pg_settings WHERE name like '%block%' or short_desc LIKE '%block%';
name | setting | short_desc | extra_desc
----------------+---------+----------------------------------------------+------------
block_size | 8192 | Shows the size of a disk block. |
wal_block_size | 8192 | Shows the block size in the write ahead log. |
(2 rows)
test_1=#Samsung SSD 840, 500 GB TPS (txns/second)
blocksize=2k 147.9
blocksize=4k 141.7
blocksize=8k 133.9
blocksize=16k 127.2
blocksize=1MB 42.5Context
StackExchange Database Administrators Q#294587, answer score: 43
Revisions (0)
No revisions yet.