patternsqlMinor
Choosing a PK: VARCHAR(2) or SMALLINT(2)?
Viewed 0 times
smallintchoosingvarchar
Problem
Today we were discussing the following:
Brazil has 27 states and each one has its own abbreviation (just like USA).
So we've got
One of our programmers proposed that we should use those abbreviations (
Extrapolating the use of our database, I countered his argument saying that if we were to - someday - expand our services to other countries, we would have a problem with repeated abbreviations, for example: in the USA there is
Now, supposing that we do not expand our services to other countries and stay only in Brazil I started considering the idea of using a VARCHAR(2) column as PK. In this scenario it doesn't sound like a totally bad idea.
Is it? Why? In which cases this could be applied? Should memory be considered in order to choose from one to another?
Brazil has 27 states and each one has its own abbreviation (just like USA).
So we've got
RJ for Rio de Janeiro, SP for São Paulo, MG for Minas Gerais and so on.One of our programmers proposed that we should use those abbreviations (
RJ, SP, MG, etc.) as PK on the States table that we are planning to add to a new project.Extrapolating the use of our database, I countered his argument saying that if we were to - someday - expand our services to other countries, we would have a problem with repeated abbreviations, for example: in the USA there is
SC for South Carolina and on Brazil we've got SC for Santa Catarina; the same happens for MT, PA and MA. Counting on this we've agreed that there should be a ID column as PK IDENTITY.Now, supposing that we do not expand our services to other countries and stay only in Brazil I started considering the idea of using a VARCHAR(2) column as PK. In this scenario it doesn't sound like a totally bad idea.
Is it? Why? In which cases this could be applied? Should memory be considered in order to choose from one to another?
Solution
In most cases I agree with @Aaron's approach, but this just happens to be one of the few instances of there being a truly natural key to use: ISO codes (ISO 3166 specifically). For those who are unfamiliar with ISO, they are, in their own words (from the main http://www.iso.org/ page):
We're ISO, the International Organization for Standardization. We develop and publish International Standards.
ISO 3166-1 describes the country codes. While there are a few choices of codes to use (2 character, 3 character, and numeric), the 2 character codes are the recommended choice and the most widely used (including for most country-based top-level domain names).
ISO 3166-2 describes each country's subdivisions (e.g. states). This standard is divided into country-based sections (e.g. ISO 3166-2:BR for Brazil) and the codes are 1, 2, or 3 alphanumeric characters.
So, you could do something like:
The idea here is that you would place the
While this method, using 3 - 5 bytes depending on the Country / Subdvision, is not as compact as using a 2-byte
Please note that I specified a
Please also note that I added the
UPDATE
Now, supposing that we do not expand our services to other countries and stay only in Brazil I started considering the idea of using a VARCHAR(2) column as PK. In this scenario it doesn't sound like a totally bad idea.
Just to have this stated, if using this method (i.e. ISO codes) to denote "States" / "Provinces" and possibly Countries, then you do technically have the ability to start out handling only "States" for a single country. In this configuration you would simply remove the
Now, if you later find that you need to expand the system to handle other countries, you can, at that time, do the following steps (none of which change any existing
We're ISO, the International Organization for Standardization. We develop and publish International Standards.
ISO 3166-1 describes the country codes. While there are a few choices of codes to use (2 character, 3 character, and numeric), the 2 character codes are the recommended choice and the most widely used (including for most country-based top-level domain names).
ISO 3166-2 describes each country's subdivisions (e.g. states). This standard is divided into country-based sections (e.g. ISO 3166-2:BR for Brazil) and the codes are 1, 2, or 3 alphanumeric characters.
So, you could do something like:
CREATE TABLE dbo.Country
(
CountryCode CHAR(2) NOT NULL
COLLATE Latin1_General_100_BIN2
PRIMARY KEY,
CountryName VARCHAR(50) NOT NULL
);
CREATE TABLE dbo.CountrySubdvision
(
CountrySubdvisionCode VARCHAR(3) NOT NULL
COLLATE Latin1_General_100_BIN2,
CountryCode CHAR(2) NOT NULL
COLLATE Latin1_General_100_BIN2
CONSTRAINT [FK_CountrySubdvision_Country]
FOREIGN KEY
REFERENCES dbo.Country(CountryCode)
ON UPDATE CASCADE,
CountrySubdvisionName VARCHAR(50) NOT NULL,
-- LocalizedSubdvisionName NVARCHAR(50) NOT NULL, -- ??
CONSTRAINT [PK_CountrySubdvision]
PRIMARY KEY (CountryCode, CountrySubdvisionCode)
);
The idea here is that you would place the
CountryCode and CountrySubdvisionCode fields in any tables that needed the "state" value, and FK back to dbo.CountrySubdvision on (CountryCode, CountrySubdvisionCode).While this method, using 3 - 5 bytes depending on the Country / Subdvision, is not as compact as using a 2-byte
SMALLINT, it does have the advantage of placing human-readable / meaningful values in those related tables. This could easily reduce some number of JOINs (when you only need the 2-character code anyway) and can (at least slightly) reduce time spent debugging certain problems.Please note that I specified a
_BIN2 (i.e. binary) collation for both of the "Code" fields to assist performance. Even with an extra couple of bytes, that should be nearly as fast as comparing two SMALLINT values. The only downside there is that you need to be consistent in adding codes in all upper-case and people need to remember that they need to use all upper-case when filtering on these codes.Please also note that I added the
ON UPDATE CASCADE clause to the FK on dbo.CountrySubdvision to handle the off-chance that a CountryCode is changed by ISO (extremely unlikely). Along those same lines, the FKs created on tables where the CountryCode and CountrySubdvisionCode fields have been placed will also need to be set to ON UPDATE CASCADE. This will propagate changes made to CountryCode in dbo.Country to the tables that have both of the fields. This also will propagate any changes made to the CountrySubdvisionCode to those related tables. This is also highly unlikely to happen (especially for where most of our systems would be dealing with) though more likely than any changes being made to the CountryCode.UPDATE
Now, supposing that we do not expand our services to other countries and stay only in Brazil I started considering the idea of using a VARCHAR(2) column as PK. In this scenario it doesn't sound like a totally bad idea.
Just to have this stated, if using this method (i.e. ISO codes) to denote "States" / "Provinces" and possibly Countries, then you do technically have the ability to start out handling only "States" for a single country. In this configuration you would simply remove the
dbo.Country table and the CountryCode field (and associated FK) from dbo.CountrySubdvision. Then you would only have CountrySubdvisionCode to place in any related tables.Now, if you later find that you need to expand the system to handle other countries, you can, at that time, do the following steps (none of which change any existing
CountrySubdvisionCode values in any of the related tables):- Create the
dbo.Countrytable
- Populate the
dbo.Countrytable
- Drop any FKs from related tables that reference
dbo.CountrySubdvision
- Drop the PK on
dbo.CountrySubdvision
- Add the
CountryCodefield todbo.CountrySubdvision, making itNOT NULLand with aDEFAULTof the CountryCode for the only country that you have been using so far (thisDEFAULTcan be removed at the end of this process, if you like)
- Recreate the PK on
dbo.CountrySubdvisionto be on(CountryCode, CountrySubdvisionCode)
- Create the FK on
dbo.CountrySubdvision
Context
StackExchange Database Administrators Q#125198, answer score: 8
Revisions (0)
No revisions yet.