HiveBrain v1.2.0
Get Started
← Back to all entries
patternModerate

What problems are solved by splitting street addresses into individual columns?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
individualwhatcolumnsareintosplittingstreetaddressessolvedproblems

Problem

We have a team who designs the tables and relations for software developers. In our organization, they are pretty strict about enforcing 3NF normalization - which to be honest, I agree with given the size of our organization and how the needs or our clients change over time. There is only one area I'm not clear about the reasons behind their design decision: addresses.

While this mostly focuses on addresses in the United States, I think this could apply to any country that does this. Each piece of an address gets its own column in the addresses table. For instance, take this gnarly U.S. address:

Attn: Jane Doe
485 1/2 N Smith St SW, APT 300B
Chicago, IL 11111-2222


It would get split up in the database like this:

  • Street number: 485



  • Street fraction: 1/2



  • Street pre-directional: N (North)



  • Street name: Smith



  • Street type: ST (Street)



  • Street post-directional: SW (Southwest)



  • City: Chicago



  • State: IL (Illinois)



  • Zip code: 11111



  • Zip4 Code: 2222



  • Country (assumed to be U.S.A.)



  • Attention: Jane Doe



  • P.O. Box: NULL



  • Dwelling type: APT (Apartment)



  • Dwelling number: 300B



And there would be a few other columns related to rural routes and contract routes. Furthermore, our specific application will likely have a few international addresses in it. The data modelers said they would add columns specific for international addresses, which would be the normal line 1, line 2 fields.

At first I thought this was WAY overboard. Researching online repeatedly refers to using address line 1, 2, 3 and possibly 4, then splitting out city, region and postal code. We do have one use case for our new application where this granularity is beneficial. We have to validate that the user is not creating a duplicate business, and checking the address is one of the validations. We can get it to work with address line 1 and 2, but it would be more difficult.

As for our specific application, we need to store multiple kinds of addresses for businesses and peo

Solution

I spent 7 years developing software for a publishing company and one of the hardest problems we ever tackled was parsing street addresses in subscription lists. It is useful to split up addresses into distinct fields, but you can never, EVER design for every possible pathological aberration of address formats and components the human brain can devise.

Every locality can have its quirks, and that's just in the US. Throw in other countries and things get unmanageable very quickly for any approach that wants to parse every address. Just two examples:

In Spain, the street number always comes after the street name and a comma, and many addresses contain a floor number ordinal, such as 1° or 3ª, along with abbreviations for "left" ("Izda" meaning left-hand-door after you get up the stairs), "right" ("Dcha") or other possibilities. Now multiply that quirkiness by the number of different countries and areas with different historical customs for addresses... (Japan? Rural England? Korea? China?)

In Portland, OR, there are N-S and E-W axes that divide the city into NW, NE, SW and SE quadrants (as well as a N "quadrant", but I digress). N-S streets are numbered incrementally East and West from this axis, and addresses on E-W streets are dictated by the N-S street number being the "hundred block" of the number (i.e. a house on an E-W street between 11th and 12th avenues would have a number like 1123). Pretty standard stuff for US addresses.

Every so often you run into a Portland address like 0205 SW Nebraska St. A leading zero? WTF? There goes my integer column for house "number".

When the grid was set up, the N-S axis was defined by the Willamette river. Everything to the East of the river was NE or SE, and West of the river NW or SW. As the city grew south they ran into the inconvenient fact that the river meanders to the East, so projecting the axis South you have this problematic area that's on the "West" side of the river but East of the axis. The solution was to add a leading zero, in effect a minus sign, with the numbers incrementing towards the East from the axis line.

If I were you I'd give up hope of designing the ultimate system. You cannot cover all possibilities, and new ones will be created as humanity pushes into previously undeveloped land.

For US addresses, take a look at what the USPS has already done in address standardization, and remember to make the house_number column a varchar. While you're at it figure out how you're going to parse 1634 E N Fort Lane Ave.

For the rest of the world, I'd probably try to abstract additional fields to cover 80-90% of what is likely to come up, and provide a set of uninterpreted fields that can handle everything else when necessary. I.e. if your parser fails to handle an address, save it unparsed and flagged as such. If you do manage to parse an address, make sure you remember the order in which you found the various fields so you can reassemble it into something deliverable.

I was going to say that the most important field is going to be post code, but even that is not a given in many places.

Good luck. This can be a fun and extremely frustrating endeavor but the key to sanity is to know when to quit trying and just store the input unparsed, or partially parsed with the original input as backup.

Context

StackExchange Database Administrators Q#133546, answer score: 18

Revisions (0)

No revisions yet.