patternMinor
Designing Database for Longitudinal Survey Data
Viewed 0 times
surveydesigningdatabaselongitudinalfordata
Problem
I'm trying to design a relational database based on a bunch of CSV data that I already have. It's mostly for the purposes of organization and querying subsets, and I won't be adding any new records via transactions. My data describes around 60 attributes of businesses, and it is longitudinal as they are surveyed each year (25 years).
My problem is that some attributes will see many duplicates over the life of a business, and others will see very few. Here's a dumbed down example:
I'm looking at around 60 million unique ids, most with a good number of years as well. If the database was in a form similar to this it would be massive and full of duplicates.
Any suggestions on a manner of organizing it?
My problem is that some attributes will see many duplicates over the life of a business, and others will see very few. Here's a dumbed down example:
id, year, company_name, type, num_employees, total_sales,
056, 2000, papa johns, fast food, 11, 45000
056, 2001, papa johns, fast food, 11, 45557
056, 2002, papa johns, fast food, 14, 50000
056, 2003, papa johns, pizza, 17, 55000
056, 2004, papa johns, pizza, 17, 55456
063, 1998, pops barbershop, barber shops, 3, 15000
063, 1999, fresh cutz, barber shops, 3, 15023
063, 2000, fresh cutz, barber shops, 3, 15000I'm looking at around 60 million unique ids, most with a good number of years as well. If the database was in a form similar to this it would be massive and full of duplicates.
Any suggestions on a manner of organizing it?
Solution
According to Dr. E. F. Codd, the originator of the Relational Model of Data:
Duplicate values are permitted within columns of a relation, but duplicate rows are prohibited.1
Therefore, at the logical level —and taking into account the stated intention to create a relational database—, the column values duplicates presented in your sample table are not harmful.
If the relevant database has to contain large volumes of information, then it consequently must be managed as such; however, there are some aspects that can be optimized conceptually, logically, physically and pragmatically speaking.
Reorganization suggestions
Expository IDEF1X diagram
Having the referred sample table and data as a reference, it appears that the combination of the columns
Nevertheless,
Sample SQL-DDL logical-level design
Then, in accordance with said IDEF1X diagram, we can derive the following logical-level DDL layout:
Company
The
The name of each company would only be retained once in the
Company Type
The
Its PK, called
This table may hold the data shown below:
+-—————————————————-+-————————————-+
| company_type_code | name |
+-—————————————————-+-————————————-+
| FF | Fast food |
+-------------------+--------------+
| PI | Pizza |
+-------------------+--------------+
| BS | Barber shops |
+-------------------+--------------+
Since the
Duplicate values are permitted within columns of a relation, but duplicate rows are prohibited.1
Therefore, at the logical level —and taking into account the stated intention to create a relational database—, the column values duplicates presented in your sample table are not harmful.
If the relevant database has to contain large volumes of information, then it consequently must be managed as such; however, there are some aspects that can be optimized conceptually, logically, physically and pragmatically speaking.
Reorganization suggestions
Expository IDEF1X diagram
Having the referred sample table and data as a reference, it appears that the combination of the columns
(year, company_name) is sufficient in ensuring row uniqueness (so it should be declared as a KEY, be it PRIMARY or ALTERNATE); therefore, the column labeled as id would be superfluous (assuming that it is a special addendum meant to contain surrogate key2 values, which lack business meaning, and, quite probably, would require handling an extra INDEX at the physical level).Nevertheless,
company_name and (company_?) type seem to be columns that hold data that is physically “heavy” (in terms of bytes, perhaps of type CHAR(n) or VARCHAR(n)), so I consider convenient to make some structural reorganizations, as depicted in the IDEF1X3 diagram contained in Figure 1:Sample SQL-DDL logical-level design
Then, in accordance with said IDEF1X diagram, we can derive the following logical-level DDL layout:
-- As these are only illustrative examples,
-- you should determine which are the most fitting
-- data types and sizes for all the table columns
-- depending on your business context characteristics.
-- Also, you should make accurate tests to define the most
-- convenient INDEX strategies.
CREATE TABLE company (
company_id INT NOT NULL,
name CHAR(30) NOT NULL,
created_datetime DATETIME NOT NULL,
--
CONSTRAINT company_PK PRIMARY KEY (company_id),
CONSTRAINT company_AK UNIQUE (name) -- ALTERNATE KEY.
);
CREATE TABLE year ( -- “Look-up” table.
year SMALLINT NOT NULL,
created_datetime DATETIME NOT NULL,
--
CONSTRAINT year_PK PRIMARY KEY (year)
);
CREATE TABLE company_type ( -- “Look-up” table.
company_type_code CHAR(2) NOT NULL,
name CHAR(30) NOT NULL,
--
CONSTRAINT company_type_PK PRIMARY KEY (company_type_code),
CONSTRAINT company_type_AK UNIQUE (name) -- ALTERNATE KEY.
);
CREATE TABLE organizational_register (
company_id INT NOT NULL,
year SMALLINT NOT NULL,
company_type_code CHAR(2) NOT NULL,
number_of_employees INT NOT NULL,
total_sales PERTINENTTYPE NOT NULL
created_datetime DATETIME NOT NULL,
--
CONSTRAINT organizational_register_PK PRIMARY KEY (company_id, year), -- Composite PRIMARY KEY.
CONSTRAINT organizational_register_TO_company_FK FOREIGN KEY (company_id)
REFERENCES company (company_id),
CONSTRAINT organizational_register_TO_year_FK FOREIGN KEY (year)
REFERENCES year (year),
CONSTRAINT organizational_register_TO_company_type_FK FOREIGN KEY (company_type_code)
REFERENCES company (company_type_code)
);Company
The
company_id column, of type INT (or something similar, depending on the platform of choice), would decidedly be much “lighter” than company.name (helping to, e.g., speed up data retrieval) that is why I added and fixed it as the PRIMARY KEY (PK) of this table.The name of each company would only be retained once in the
company.name column, optimizing disk space usage if that is one of the (physical) concerns. Said column should be established as an ALTERNATE KEY (AK) by means of UNIQUE and NOT NULL constraints, which prevents the duplication of the same name value (thus, protecting data integrity and consistency).Company Type
The
company_type table would carry out a “look-up” role.Its PK, called
company_type_code, being of type CHAR(2) can retain values with business meaning and is far smaller than the column that encloses the full company_type.name, points that enhance data retrieval quickness and maintain readability (quite useful when, e.g., the end-users are interpreting result sets, the data is analyzed with a reporting tool, at the application program code debugging phase, etc.).This table may hold the data shown below:
+-—————————————————-+-————————————-+
| company_type_code | name |
+-—————————————————-+-————————————-+
| FF | Fast food |
+-------------------+--------------+
| PI | Pizza |
+-------------------+--------------+
| BS | Barber shops |
+-------------------+--------------+
Since the
company_type.name values would be keCode Snippets
-- As these are only illustrative examples,
-- you should determine which are the most fitting
-- data types and sizes for all the table columns
-- depending on your business context characteristics.
-- Also, you should make accurate tests to define the most
-- convenient INDEX strategies.
CREATE TABLE company (
company_id INT NOT NULL,
name CHAR(30) NOT NULL,
created_datetime DATETIME NOT NULL,
--
CONSTRAINT company_PK PRIMARY KEY (company_id),
CONSTRAINT company_AK UNIQUE (name) -- ALTERNATE KEY.
);
CREATE TABLE year ( -- “Look-up” table.
year SMALLINT NOT NULL,
created_datetime DATETIME NOT NULL,
--
CONSTRAINT year_PK PRIMARY KEY (year)
);
CREATE TABLE company_type ( -- “Look-up” table.
company_type_code CHAR(2) NOT NULL,
name CHAR(30) NOT NULL,
--
CONSTRAINT company_type_PK PRIMARY KEY (company_type_code),
CONSTRAINT company_type_AK UNIQUE (name) -- ALTERNATE KEY.
);
CREATE TABLE organizational_register (
company_id INT NOT NULL,
year SMALLINT NOT NULL,
company_type_code CHAR(2) NOT NULL,
number_of_employees INT NOT NULL,
total_sales PERTINENTTYPE NOT NULL
created_datetime DATETIME NOT NULL,
--
CONSTRAINT organizational_register_PK PRIMARY KEY (company_id, year), -- Composite PRIMARY KEY.
CONSTRAINT organizational_register_TO_company_FK FOREIGN KEY (company_id)
REFERENCES company (company_id),
CONSTRAINT organizational_register_TO_year_FK FOREIGN KEY (year)
REFERENCES year (year),
CONSTRAINT organizational_register_TO_company_type_FK FOREIGN KEY (company_type_code)
REFERENCES company (company_type_code)
);CREATE TABLE company (
company_id INT NOT NULL,
name CHAR(30) NOT NULL,
created_datetime DATETIME NOT NULL,
--
CONSTRAINT company_PK PRIMARY KEY (company_id),
CONSTRAINT company_AK UNIQUE (name) -- ALTERNATE KEY.
);
CREATE TABLE year (
year SMALLINT NOT NULL,
created_datetime DATETIME NOT NULL,
--
CONSTRAINT year_PK PRIMARY KEY (year)
);
CREATE TABLE sales (
company_id INT NOT NULL,
year SMALLINT NOT NULL,
total_sales PERTINENTTYPE NOT NULL
column_x FOO NOT NULL,
column_y BAR NOT NULL,
created_datetime DATETIME NOT NULL,
--
CONSTRAINT yearly_sales_PK PRIMARY KEY (company_id, year), -- Composite PRIMARY KEY.
CONSTRAINT sales_TO_company_FK FOREIGN KEY (company_id)
REFERENCES company (company_id),
CONSTRAINT sales_TO_year_FK FOREIGN KEY (year)
REFERENCES year (year),
);
CREATE TABLE company_type (
company_type_code CHAR(2) NOT NULL,
name CHAR(30) NOT NULL,
--
CONSTRAINT company_type_PK PRIMARY KEY (company_type_code),
CONSTRAINT company_type_AK UNIQUE (name) -- ALTERNATE KEY.
);
CREATE TABLE organizational_register (
company_id INT NOT NULL,
year SMALLINT NOT NULL,
company_type_code CHAR(2) NOT NULL,
number_of_employees INT NOT NULL,
created_datetime DATETIME NOT NULL,
--
CONSTRAINT organizational_register_PK PRIMARY KEY (company_id, year), -- Composite PRIMARY KEY.
CONSTRAINT organizational_register_TO_company_FK FOREIGN KEY (company_id)
REFERENCES company (company_id),
CONSTRAINT organizational_register_TO_year_FK FOREIGN KEY (year)
REFERENCES year (year),
CONSTRAINT organizational_register_TO_company_type_FK FOREIGN KEY (company_type_code)
REFERENCES company (company_type_code)
);Context
StackExchange Database Administrators Q#152791, answer score: 9
Revisions (0)
No revisions yet.