patternsqlMinor

Why do we need partitions in PostgresSQL? (or any other RDBMS)

Submitted by: @import:stackexchange-dba·Mar 10, 2026·

Viewed 0 times

whypostgressqlneedanypartitionsotherrdbms

Problem

So, I am trying to teach myself partitions in PostgreSQL. I understand that a database can become slow when a table hits millions+ of lines and indexes don't fit into memory anymore.

One thing that strikes me is how partitions with sub-partitions resemble a lot to some kind of manually managed indexes. What is wrong with regular indexes? Why can't them partition themselves?

I imagine this must be hard, judging by the fact that this problem seems to have been around for a while. Maybe the DB cannot know how we plan to access our data? If so, why isn't there some simple way to tell the DB how we want to access our data?

Why do we have to do the whole table inheritance dance when partitionning is such a common case?

EDIT: There seem to be some work in progress to support a nicer syntax, Cf. https://wiki.postgresql.org/wiki/Table_partitioning

Solution

A general goal in designing data storage and retrieval systems is that query time should scale with the amount of data being retrieved, not the amount of data that exists in total. Partitioning is a powerful tool for achieving that goal.

Consider a table of session data that looks something like:

create table clickstream (
  clickstream_id bigserial primary key
, created timestamp default now() not null
, user_agent text
, client_ip inet
, entry_page text
, session_id text
--and so on...
);

Obviously you would have an index on created which would allow you to perform queries on a particular date/time range of the data. If you get 10k sessions a day and you have 3 months of data online, querying for a day's worth of data works quite well. If you get 1M sessions a day and you have 4 years of data online, this works poorly.

But why? The nature of the I/O you're performing -- random reads -- doesn't scale well. The query engine has to find the location in the created B-tree that corresponds to the start of your range, and likewise the end of your range, and traverse every element of the tree in between, and read each row individually. Or rather, read each relevant block individually (since all I/O is at the block level), but many rows means many blocks, especially for wide tables.

What you would prefer to do is read all of those blocks sequentially, as in a full table scan. In fact what if you are doing this kind of operation over and over again, you'll find yourself creating a temporary table with just the day's data and querying that instead, as an optimization. This is the kind of optimization that partitioning let's the query engine do automatically.

Partition the table by day, and when you write (for example) select ... from clickstream where created >= current_date-1 and created < current_date the query engine will understand that this data falls entirely within one partition, and then compute that even if created is indexed on that partition that using said index is not worthwhile, and perform a much faster sequential scan.

Now, that's for an analytic workload. Partitioning can also be valuable in a transactional workload, but in more limited ways. Obviously if your transactional system is doing range scans in real-time then the exact same pattern can apply. But it can also be useful to partition on certain keys. For example if you have an orders table with a customer_id, and the nature of your business is such that most customers have many orders, and the orders table is always accessed using customer_id, you can divide the orders table into an arbitrary number of partitions, say 64, on a hash of customer_id. This has the effect of reducing the overall table and index size when performing a query. More of the rows in a particular block will be relevant, making read-ahead more effective. Partition-local B-trees will have fewer nodes, and therefore fewer blocks, and potentially even less depth.

To get back to the heart of your question, which if I understand it is "why do we have to do this at all?", the simple reality is that YOU always know more about your data than the RDBMS does. This is the same reason query planners sometimes make mistakes. Query planners can be supplied with better statistics, or given hints, but they're just not smart enough to wholesale restructure your data for you. That is a high-level pattern recognition task that your brain is better suited to than any piece of code. What's more, this kind of restructuring must be carefully considered and then implemented by hand. I wouldn't even trust an RDBMS to automatically create indexes for me, much less rebuild the physical layout of a multi-terabyte table!

(One final note, PostgreSQL in particular only sorta supports partitioning. It's much easier to implement all of the above in an RDBMS like Oracle that supports it as a first-class construct.)

Code Snippets

create table clickstream (
  clickstream_id bigserial primary key
, created timestamp default now() not null
, user_agent text
, client_ip inet
, entry_page text
, session_id text
--and so on...
);

Context

StackExchange Database Administrators Q#95925, answer score: 4

Revisions (0)

No revisions yet.