HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMajor

What is the reason to "normalize your databases"?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
normalizetheyourdatabaseswhatreason

Problem

In the past, I've been told (on this site) that I should normalize the values in the database - using a lookup table instead of using direct (string) keys.

I am confused why this is so good that several people recommended this. Is it just for memory consumption? But then in my case (explained below) how much is that?

Consider I have a dictionary database for a website:

CREATE TABLE dictionary
(
    id serial NOT NULL,
    key text NOT NULL,
    language text NOT NULL,
    value text,
    PRIMARY KEY (id)
)


And then insertion would happen like:

INSERT INTO public.dictionary VALUES
    ('yes_button', 'en', 'yes'), ('yes_button', 'nl', 'ja')


Or instead of 'en' I'd use 'en-us'. Now I've been told to "normalize" the database - which would mean having a lookup table that binds the string representation of a language ('en', 'nl' to a value):

CREATE TABLE languages
(
    id serial NOT NULL,
    language text
)

CREATE TABLE dictionary
(
    id serial NOT NULL,
    key text NOT NULL,
    language integer NOT NULL,
    value text,
    PRIMARY KEY (id),
    FOREIGN KEY (language)
        REFERENCES  public.languages (id)
)


However, this would increase complexity quite a bit, since insertion can no longer be simple — it needs to either check the foreign table on the backend, or use some more complex SQL. So there is a real cost to updating to this design.

What are the advantages?

-
Is it just the storage size? An integer foreign key reference is 4 bytes anyways, and a size 2 string is 3 bytes (while a size 5 is 6 bytes - so save 2 bytes at most).

-
Is it speed of database? But isn't this then a micro-optimization, which is the "root of all evil"?

-
Is it just to make sure that each language "exists" before insertion? There are other mechanisms, and languages should be created on the fly anyway. Existence isn't based on what is in our database, but rather an external lookup to the standards, and the language is properly added once a user pr

Solution

Your issue is that you are getting two different pieces of advice conflated into one and the justifications for each piece of advice are not being presented clearly.

Recomendation 1: Normalize your database

In any transactional database this is generally considered a best practice. There are lots of reasons why you might back away from this and there are applications, like BI data warehouses, where this is not necessarily what you want. However, for a transactional database you normalize to begin with and denormalize when necessary as a rule of thumb.

Where there seems to be some confusion is around why to normalize. You are not alone in this confusion. A lot of people have a lot of misconceptions about the purpose of normalizing your database.

Normalization is NOT (primarily) about:

  • Increasing performance



  • Saving memory



  • Saving disk space



  • Reducing duplication (it is about reducing redundancy, but that is not exactly the same thing as duplication - more below)



Normalization IS about:

  • Data Quality (first and foremost)



Normalizing your database design means that you don't store a piece of data in multiple places so that if you need to update or delete it for any reason you might get yourself in trouble because the data will need to change in more than one place. When you don't normalize you end up with the very real possibility of your data becoming inconsistent over time as changes are made inconsistently. This is a little bit of an over simplification because there are other benefits of normalization, such as simplified query logic. However, data quality is by far the most important benefit and, ironically, it is the one which programmers without formal database design training most often fail to understand clearly.

Recommendation 2: Use Integer Surrogate Keys

This falls much more under the rubric of common custom than anything else. Lots of people like to use a meaningless integer primary key in all their relational database tables. This in itself is actually two pieces of advice: (i) use surrogate keys in every table and (ii) use integers for surrogate keys.

Different people will give you different reasons for why they consider these best practices and all of these reasons can be argued over on a case by case basis. The best argument I can think of for using surrogate keys is that natural keys are more likely to be changed. Changing any primary key is a giant pain so it's best to avoid if you can. The best argument I can think of for using integers for your surrogate key is that it's a nice simple data type which is compact and efficient. Again, this is highly situational so people will make an argument for or against this in different cases.

What I would say about Recommendation 2 overall is this: pick a lane and stick to it so that your code is relatively consistent and diverge from this only when you have a really compelling case of critical performance or critical efficiency and you can clearly demonstrate that diverging from your usual approach has significant benefits.

In general, I avoid using natural keys, but there are times when you can get away with it, and even times when it makes better sense to use them. The question you need to ask yourself with a natural key is "Might this change in the future?"

My rule of thumb is "if a user can see it, they're going to want to change it someday". In your specific case though, language codes are set by an international standards body, so the chance that they might change are pretty slim — it would be too big of a pain. I wouldn't hesitate in your case to use "en-us" as a key value.

Context

StackExchange Database Administrators Q#291639, answer score: 39

Revisions (0)

No revisions yet.