HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Normalize or Denormalize - Which would be better

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
normalizedenormalizebetterwouldwhich

Problem

I have the following data structures:

Regions ( About 150 Regions )
-------
id    Region_Name

Sub Regions (10 -12 for each region)
------------------------------------
id  Sub_Region_Name  Region_id

Showrooms ( 150 - 170 per Sub Region )
--------------------------------------
id  Showroom_Name  Sub_Region_id

Users ( 750 - 1,000 per Showroom)
-----------------------------------------
id  UserName Address  Age  Showroom_Id


What would be the best option for the Users table?

Should I have all the users in one table, which would take the number of rows to 20 million; or should I use a table structure like Users_Arizona, Users_Ohio etc. as I am most interested in users grouped by region.

The database is not open to public interaction.

Solution

First of all the data model must be accurate. It must capture the values required by the business in order to perform its functions and answer its queries.

In this particular case I believe you want to associate each user with one particular showroom. If my belief is correct then the model you show fulfils this requirement.

The second concern is to make the system fast. You seem worried that having between 160 million and 300 million rows in a single Users table will be unworkable. You propose to preemptively address this by having multiple Users tables, one per region, for about 150 Users_xxx tables in total. The rationale for splitting at this level is that the majority of the queries will aggregate by region.

I would suggest retaining a single user table, for DB maintenance and application development reasons.

It is unlikely the list of regions is permanently fixed. This means new regions will be introduced and current ones removed. In turn that means tables must be created and dropped, with application changes to match. What should be a simple data change has become a system maintenance project.

A typical application function is to find a user based on natural keys, surname for example. With a single table this is implemented as

select 
from Users
where surname = 'the-name'


Surname (and other columns used in queries) can be indexed to make this fast. Covering indexes can be defined to keep all the required data contiguous on disk. Range scans can be made fast with disk read-ahead etc.

With split tables it is

select 
from Users_AAA
where surname = 'the-name'

union all

select 
from Users_BBB
where surname = 'the-name'

union all

...
repeat 147 more times
...

select 
from Users_ZZZ
where surname = 'the-name'


Each table will have its own little bit of disk. These are unlikely to be contiguous. Every query will have to hit every table, even if they're indexed. It is unlikely to be fast. Once again this query must change every time the list of regions changes. It can be hidden from the application somewhat by enclosing it in a view, but the maintenance task remains.

As a starting point, I would maintain a single Users table with a foreign key to showroom, just as you've shown. If testing proved to be a problem in practice at the anticipated scale, on production-sized hardware, I'd consider propagating the key of Region into SubRegion and Showroom, giving them compound keys. Users table can then be indexed / clustered by Region to satisfy the given query. I may even use the natural key for this, rather than the surrogate, to remove further steps from the execution plan. The foreign key constraints will ensure only correct Region_Id appear in the child tables.

This could also be achieved by demnormalization. Region_Id column is added to Users but not to SubRegion or Showroom. The value is copied in when a row is written to Users. Assuming a user never switches Showroom/ Subregion / Region you're done. The system cannot help you ensure Users.Showroom_Id and Users.Region_Id correspond to each other.

Code Snippets

select <columns>
from Users
where surname = 'the-name'
select <columns>
from Users_AAA
where surname = 'the-name'

union all

select <columns>
from Users_BBB
where surname = 'the-name'

union all

...
repeat 147 more times
...

select <columns>
from Users_ZZZ
where surname = 'the-name'

Context

StackExchange Database Administrators Q#198873, answer score: 2

Revisions (0)

No revisions yet.