patternMinor
Any benefit in replacing a varchar for an index in the main table?
Viewed 0 times
themainbenefitanyvarcharreplacingforindextable
Problem
I'm 'inherithing' the job of a database admin for a very small team (3-4 people) and while I know how to execute SQL queries relatively well and understand the basics of programming efficiency , I still new in the realm of thinking about database architetures.
I noticed the previous admin, build this weird schema that I
I noticed the previous admin, build this weird schema that I
m still trying to figure out the benefits. Since he's more experienced than I in DB architecture there must be some positives to it. Unfortunately he's no longer around to ask for his motivations.
In essence, whenever there would be a character column whose values repeat often (say, the salesman_name in table of sales), this character column would be replaced by an index (salesman_id), functioning as a foreign key to another table. This second table would only contains the foreign key (salesman_id) and list of unique character values (salesman_name). There are no other tables that use this character column.
Why to do that?
I could understand this practice if there were other tables linking to salesman_name, so an update to salesman_name can be done in one single place. I could also understand in terms of saving disk space since less bytes are replicated every row. However our sales` table has "only" 5 million rows, and even in csv format it has less than a 1Gb in size. And while it saves on disk, it requires a Join everytime we want to look at the fully fledged sales table. This happens so often that the previous admin even set a view with this join already done. But why to separate solely to unite later? Is there any other obvious reason to separate repeating data that I'm missing?Solution
Most relational databases use paged storage. The unit of IO as far as the DBMS is concerned is 1 page. Each page will typically hold multiple rows. Each row fits entirely within a single page.
When a row is needed for a query the whole of its page is read in. All of the other rows on that page can be used for other queries without incurring further IO cost. So the more rows that fit on a page the less IO there will be and the faster the system runs.
There are many other technologies and implementations than that but the above covers many typical cases.
One way to fit more rows on a page is to make each row shorter. Replacing names (15-40 bytes?) with an int (4 bytes) will help with this. The int to string mapping will be a separate table. Since there are relatively few distinct string values this mapping table will be small. On balance there will be a net reduction in IO.
You can think of this as a form of dictionary compression. Arguably it is also more normalised as correcting a typo will update one column in one row avoiding the risk of inconsistent data. I'm less convinced that swapping a natural key for a surrogate key counts as normalisation.
Yes there must be joins to return a full sale row. The data to join is more likely to be in memory and DBMS are optimised for joins, so it's likely to be faster than that equivalent disk access.
If all the tables fit comfortably in RAM with enough left over for query execution then space minimisation isn't worth the effort. Perhaps when the system started out it ran on a 286 with a 500Mb disk when it was worthwhile? Perhaps it's a habit the designer carried from past experience and never thought to question.
When a row is needed for a query the whole of its page is read in. All of the other rows on that page can be used for other queries without incurring further IO cost. So the more rows that fit on a page the less IO there will be and the faster the system runs.
There are many other technologies and implementations than that but the above covers many typical cases.
One way to fit more rows on a page is to make each row shorter. Replacing names (15-40 bytes?) with an int (4 bytes) will help with this. The int to string mapping will be a separate table. Since there are relatively few distinct string values this mapping table will be small. On balance there will be a net reduction in IO.
You can think of this as a form of dictionary compression. Arguably it is also more normalised as correcting a typo will update one column in one row avoiding the risk of inconsistent data. I'm less convinced that swapping a natural key for a surrogate key counts as normalisation.
Yes there must be joins to return a full sale row. The data to join is more likely to be in memory and DBMS are optimised for joins, so it's likely to be faster than that equivalent disk access.
If all the tables fit comfortably in RAM with enough left over for query execution then space minimisation isn't worth the effort. Perhaps when the system started out it ran on a 286 with a 500Mb disk when it was worthwhile? Perhaps it's a habit the designer carried from past experience and never thought to question.
Context
StackExchange Database Administrators Q#320782, answer score: 3
Revisions (0)
No revisions yet.