HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Designing Database for Longitudinal Survey Data

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
surveydesigningdatabaselongitudinalfordata

Problem

I'm trying to design a relational database based on a bunch of CSV data that I already have. It's mostly for the purposes of organization and querying subsets, and I won't be adding any new records via transactions. My data describes around 60 attributes of businesses, and it is longitudinal as they are surveyed each year (25 years).

My problem is that some attributes will see many duplicates over the life of a business, and others will see very few. Here's a dumbed down example:

id, year, company_name, type, num_employees, total_sales,
056, 2000, papa johns, fast food, 11, 45000
056, 2001, papa johns, fast food, 11, 45557
056, 2002, papa johns, fast food, 14, 50000
056, 2003, papa johns, pizza, 17, 55000
056, 2004, papa johns, pizza, 17, 55456
063, 1998, pops barbershop, barber shops, 3, 15000
063, 1999, fresh cutz, barber shops, 3, 15023
063, 2000, fresh cutz, barber shops, 3, 15000


I'm looking at around 60 million unique ids, most with a good number of years as well. If the database was in a form similar to this it would be massive and full of duplicates.

Any suggestions on a manner of organizing it?

Solution

According to Dr. E. F. Codd, the originator of the Relational Model of Data:

Duplicate values are permitted within columns of a relation, but duplicate rows are prohibited.1

Therefore, at the logical level —and taking into account the stated intention to create a relational database—, the column values duplicates presented in your sample table are not harmful.

If the relevant database has to contain large volumes of information, then it consequently must be managed as such; however, there are some aspects that can be optimized conceptually, logically, physically and pragmatically speaking.
Reorganization suggestions
Expository IDEF1X diagram

Having the referred sample table and data as a reference, it appears that the combination of the columns (year, company_name) is sufficient in ensuring row uniqueness (so it should be declared as a KEY, be it PRIMARY or ALTERNATE); therefore, the column labeled as id would be superfluous (assuming that it is a special addendum meant to contain surrogate key2 values, which lack business meaning, and, quite probably, would require handling an extra INDEX at the physical level).

Nevertheless, company_name and (company_?) type seem to be columns that hold data that is physically “heavy” (in terms of bytes, perhaps of type CHAR(n) or VARCHAR(n)), so I consider convenient to make some structural reorganizations, as depicted in the IDEF1X3 diagram contained in Figure 1:

Sample SQL-DDL logical-level design

Then, in accordance with said IDEF1X diagram, we can derive the following logical-level DDL layout:

-- As these are only illustrative examples, 
-- you should determine which are the most fitting 
-- data types and sizes for all the table columns 
-- depending on your business context characteristics.

-- Also, you should make accurate tests to define the most
-- convenient INDEX strategies.

CREATE TABLE company (
    company_id       INT      NOT NULL,
    name             CHAR(30) NOT NULL,
    created_datetime DATETIME NOT NULL,
    --
    CONSTRAINT company_PK PRIMARY KEY (company_id),
    CONSTRAINT company_AK UNIQUE      (name) -- ALTERNATE KEY.
);

CREATE TABLE year ( -- “Look-up” table.
    year             SMALLINT NOT NULL,
    created_datetime DATETIME NOT NULL,
    --
    CONSTRAINT year_PK PRIMARY KEY (year)
);

CREATE TABLE company_type ( -- “Look-up” table.
    company_type_code CHAR(2)  NOT NULL,
    name              CHAR(30) NOT NULL,
    --
    CONSTRAINT company_type_PK PRIMARY KEY (company_type_code),
    CONSTRAINT company_type_AK UNIQUE      (name) -- ALTERNATE KEY.
);

CREATE TABLE organizational_register (
    company_id          INT           NOT NULL,
    year                SMALLINT      NOT NULL,
    company_type_code   CHAR(2)       NOT NULL,
    number_of_employees INT           NOT NULL,
    total_sales         PERTINENTTYPE NOT NULL
    created_datetime    DATETIME      NOT NULL,
    --
    CONSTRAINT organizational_register_PK                 PRIMARY KEY (company_id, year), -- Composite PRIMARY KEY.
    CONSTRAINT organizational_register_TO_company_FK      FOREIGN KEY (company_id)
        REFERENCES company (company_id),
    CONSTRAINT organizational_register_TO_year_FK         FOREIGN KEY (year)
        REFERENCES year    (year),
    CONSTRAINT organizational_register_TO_company_type_FK FOREIGN KEY (company_type_code)
        REFERENCES company (company_type_code)
);


Company

The company_id column, of type INT (or something similar, depending on the platform of choice), would decidedly be much “lighter” than company.name (helping to, e.g., speed up data retrieval) that is why I added and fixed it as the PRIMARY KEY (PK) of this table.

The name of each company would only be retained once in the company.name column, optimizing disk space usage if that is one of the (physical) concerns. Said column should be established as an ALTERNATE KEY (AK) by means of UNIQUE and NOT NULL constraints, which prevents the duplication of the same name value (thus, protecting data integrity and consistency).

Company Type

The company_type table would carry out a “look-up” role.

Its PK, called company_type_code, being of type CHAR(2) can retain values with business meaning and is far smaller than the column that encloses the full company_type.name, points that enhance data retrieval quickness and maintain readability (quite useful when, e.g., the end-users are interpreting result sets, the data is analyzed with a reporting tool, at the application program code debugging phase, etc.).

This table may hold the data shown below:
+-—————————————————-+-————————————-+
| company_type_code | name |
+-—————————————————-+-————————————-+
| FF | Fast food |
+-------------------+--------------+
| PI | Pizza |
+-------------------+--------------+
| BS | Barber shops |
+-------------------+--------------+

Since the company_type.name values would be ke

Code Snippets

-- As these are only illustrative examples, 
-- you should determine which are the most fitting 
-- data types and sizes for all the table columns 
-- depending on your business context characteristics.

-- Also, you should make accurate tests to define the most
-- convenient INDEX strategies.

CREATE TABLE company (
    company_id       INT      NOT NULL,
    name             CHAR(30) NOT NULL,
    created_datetime DATETIME NOT NULL,
    --
    CONSTRAINT company_PK PRIMARY KEY (company_id),
    CONSTRAINT company_AK UNIQUE      (name) -- ALTERNATE KEY.
);

CREATE TABLE year ( -- “Look-up” table.
    year             SMALLINT NOT NULL,
    created_datetime DATETIME NOT NULL,
    --
    CONSTRAINT year_PK PRIMARY KEY (year)
);

CREATE TABLE company_type ( -- “Look-up” table.
    company_type_code CHAR(2)  NOT NULL,
    name              CHAR(30) NOT NULL,
    --
    CONSTRAINT company_type_PK PRIMARY KEY (company_type_code),
    CONSTRAINT company_type_AK UNIQUE      (name) -- ALTERNATE KEY.
);

CREATE TABLE organizational_register (
    company_id          INT           NOT NULL,
    year                SMALLINT      NOT NULL,
    company_type_code   CHAR(2)       NOT NULL,
    number_of_employees INT           NOT NULL,
    total_sales         PERTINENTTYPE NOT NULL
    created_datetime    DATETIME      NOT NULL,
    --
    CONSTRAINT organizational_register_PK                 PRIMARY KEY (company_id, year), -- Composite PRIMARY KEY.
    CONSTRAINT organizational_register_TO_company_FK      FOREIGN KEY (company_id)
        REFERENCES company (company_id),
    CONSTRAINT organizational_register_TO_year_FK         FOREIGN KEY (year)
        REFERENCES year    (year),
    CONSTRAINT organizational_register_TO_company_type_FK FOREIGN KEY (company_type_code)
        REFERENCES company (company_type_code)
);
CREATE TABLE company (
    company_id       INT      NOT NULL,
    name             CHAR(30) NOT NULL,
    created_datetime DATETIME NOT NULL,
    --
    CONSTRAINT company_PK PRIMARY KEY (company_id),
    CONSTRAINT company_AK UNIQUE      (name) -- ALTERNATE KEY.
);

CREATE TABLE year (
    year             SMALLINT NOT NULL,
    created_datetime DATETIME NOT NULL,
    --
    CONSTRAINT year_PK PRIMARY KEY (year)
);

CREATE TABLE sales (
    company_id       INT           NOT NULL,
    year             SMALLINT      NOT NULL,
    total_sales      PERTINENTTYPE NOT NULL
    column_x         FOO           NOT NULL,
    column_y         BAR           NOT NULL,
    created_datetime DATETIME      NOT NULL,
    --
    CONSTRAINT yearly_sales_PK     PRIMARY KEY (company_id, year), -- Composite PRIMARY KEY.
    CONSTRAINT sales_TO_company_FK FOREIGN KEY (company_id)
        REFERENCES company (company_id),
    CONSTRAINT sales_TO_year_FK    FOREIGN KEY (year)
        REFERENCES year    (year),
);

CREATE TABLE company_type (
    company_type_code CHAR(2)  NOT NULL,
    name              CHAR(30) NOT NULL,
    --
    CONSTRAINT company_type_PK PRIMARY KEY (company_type_code),
    CONSTRAINT company_type_AK UNIQUE      (name) -- ALTERNATE KEY.
);

CREATE TABLE organizational_register (
    company_id          INT      NOT NULL,
    year                SMALLINT NOT NULL,
    company_type_code   CHAR(2)  NOT NULL,
    number_of_employees INT      NOT NULL,
    created_datetime    DATETIME NOT NULL,
    --
    CONSTRAINT organizational_register_PK                 PRIMARY KEY (company_id, year), -- Composite PRIMARY KEY.
    CONSTRAINT organizational_register_TO_company_FK      FOREIGN KEY (company_id)
        REFERENCES company (company_id),
    CONSTRAINT organizational_register_TO_year_FK         FOREIGN KEY (year)
        REFERENCES year    (year),
    CONSTRAINT organizational_register_TO_company_type_FK FOREIGN KEY (company_type_code)
        REFERENCES company (company_type_code)
);

Context

StackExchange Database Administrators Q#152791, answer score: 9

Revisions (0)

No revisions yet.