HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Should I split this large table into three smaller tables?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
thisthreetablesintosplitlargesmallershouldtable

Problem

I'm designing a PostgreSQL database that will contain information about photos uploaded by my users. All users will have at least one main photo and optionally one or more public and/or private photos. My first schema is as follows:

user
----
id (PK)

photo
-----
id (PK)
user_id   (FK to user)
photo_id  (unique identifier such as "aK1q9")
type      ("main" or "other")
access    ("public" or "private")


The query that will get run the most is:

SELECT p.photo_id FROM photo p INNER JOIN user u ON p.user_id = u.id WHERE p.type = 'main' AND u.id = (some user id);


The next most popular query will be:

SELECT p.photo_id FROM photo p INNER JOIN user u ON p.user_id = u.id WHERE p.type = 'other' AND p.access = 'public' AND u.id = (some user id);


The problem I forsee is that the Photo table will become extremely large over time as it will contain all public and private photos uploaded by all users. Since my most popular query will only be looking for the main photo IDs, would it make more sense to break my photo table up into three tables?

main_photo
----------
id (PK)
user_id   (FK to user but in a one-to-one relationship to user)
photo_id

other_public_photo
------------------
id (PK)
user_id   (FK to user)
photo_id

other_private_photo
-------------------
id (PK)
user_id   (FK to user)
photo_id


I would think this latter schema would be preferable because 1) each photo's type and access information is made explicit by where it's stored thus eliminating additional ANDs in my queries; and 2) my queries would run faster since they would be run against one of the three smaller tables rather than one huge table. Which is the optimal approach from a performance standpoint?

Thanks.

Solution

@CoderAbsolute's answer gives you a good design for your tables. Since he or she did not go into detail about why this approach is better, I thought it was worth adding another answer.

First of all, design your table structure in accordance with how your data fits together. Don't try to smoosh different types of things into one table, don't add several tables for the same kind of records. Try to 'normalize' - if a table will have the same info repeated many times, then move that info into a different table, and link it from the first table using a foreign key. You should be aware of what first normal form, second normal form and third normal form are. Many real-world databases do not match these standards completely, but being aware of what you should aim for will help you make a much cleaner design.

I would say, don't worry about optimization until you've already got a correct design. First of all, you don't yet know how many entries you will have in your tables. Secondly, database engines are designed to make queries as fast as possible, even if you have a lot of entries. Don't second guess the developers of your database software.

When you've figured out that you really do have a bottleneck, you should look first at indexing. We think of a database table as being something like an array. In reality, it could be stored as a bunch of 'blobs', one for each row, which might all have different locations on a disk, not necessarily in order. (This is not a super accurate technical description of DB internals. But it should give you a clearer picture of how the pieces fit together.) How does the database find all the blobs? It has a somewhere, a list of 'pointers' which tell it where to find each blob. So typically, finding every single row of a table is an efficient process. It goes through the list and records each one.

Now, suppose that you most commonly retrieve all the photos for a given user. This might be a slow process, since it has to go through every single row of the Photo table, and look at the UserId field. In this case, you should add an Index on that field. An index is something like a lookup table, which allows the software to quickly find the location of all the rows with a given UserId. So it doesn't have to check every row in turn.

There are different kinds of indexes. Some are optimized for matching a particular value. This is probably the type you want for the UserId column. Some are optimized for finding things greater or smaller than some value. This might make sense for timestamps. (Give me all the photos from the last month.) You can have an index on multiple columns if this is what you regularly query on, or even on some function of one or more columns. Some indexes mean that similar items are stored close to each other on disk, to take advantage of caching to retrieve them more quickly.

You should familiarize yourself with the possibilities for your DBMS. Experimentation can also be very valuable here, since the exact speedups depend on your settings and also your hardware configuration.

Context

StackExchange Database Administrators Q#117807, answer score: 2

Revisions (0)

No revisions yet.