patternsqlMinor
Normalizing/combing mutliple tables with similar data
Viewed 0 times
tablescombingwithdatamutliplenormalizingsimilar
Problem
I'm working on setting up a database of vehicle records based on data from a third party vendor. There are 3 models: Vehicles, Transmissions, and Adaptors.
Both Vehicles and Transmissions are 1:n Adaptors. Technically Transmissions are just Vehicles with additional transmission columns, while Vehicles contain fuel system columns in addition to the vehicle columns. I am having trouble determining the best way of combining my data into one normalized set.
Here are some samples of my data:
Transmissions table
Vehicles Table
Pivot Tables
As you can see, Transmissions and Vehicles are basically the same up to the
Ideally, I am hoping to end up with just one table of vehicle data. For instance, if I were to merge the highlighted records, I should end up with only 2 records for a
One with a 6.0L V8 engine and the other with a 6.6L V8 engine, along with the respective fuel system and transmission columns for each.
The pivot tables would also be combined as well which means I need to replace the old IDs with new IDs from the combined data, somehow.
Here are my thoughts on how to approach this:
I apologize for the lengthy post, but I have yet to find anything in my searches that really outlines what this process may look like. Any advice
Both Vehicles and Transmissions are 1:n Adaptors. Technically Transmissions are just Vehicles with additional transmission columns, while Vehicles contain fuel system columns in addition to the vehicle columns. I am having trouble determining the best way of combining my data into one normalized set.
Here are some samples of my data:
Transmissions table
Vehicles Table
Pivot Tables
As you can see, Transmissions and Vehicles are basically the same up to the
trans_type/fuel_system columns (respectively) and beyond. For example, the Vehicle I have highlighted is the same as the 2nd Transmission shown with engine_code LML. Ideally, I am hoping to end up with just one table of vehicle data. For instance, if I were to merge the highlighted records, I should end up with only 2 records for a
2013 GMC Sierra 3500 HD:One with a 6.0L V8 engine and the other with a 6.6L V8 engine, along with the respective fuel system and transmission columns for each.
The pivot tables would also be combined as well which means I need to replace the old IDs with new IDs from the combined data, somehow.
Here are my thoughts on how to approach this:
- Normalize the similar columns (makes, models, engine, transmission)
- Simply split off the
fuel_systemandtrans_typecolumns into their own tables (but this does nothing to help me combine, it just makes things a bit more manageable until I can figure out how to combine)
- Make a new table containing all the columns from both tables, and insert the data from Vehicles and Transmissions, fill in the blanks (i.e. update fuel_system where records match and vice versa for transmission columns), then begin to clean up duplicates.
I apologize for the lengthy post, but I have yet to find anything in my searches that really outlines what this process may look like. Any advice
Solution
Functional dependencies and normalization
In order to carry out a normalization exercise involving second and further normal forms —as per the relational model of data by Dr. E. F Codd—, one first has to know what are the relevant functional dependencies (FDs for brevity) between the attributes (usually depicted as columns) of an adapted mathematical relation (usually portrayed as a table). This kind of exercise belongs to the logical-level of abstraction of a database. That is why, in order to address your stated intention of obtaining a normalized set, I requested information about the applicable FDs via comments.
For instance, a FD involving the hypothetical attributes Foo and Bar can be depicted as Foo → Bar, which may in turn be read as “attribute Foo determines attribute Bar” or “attribute Bar is determined by attribute Foo”. In this way one can distinguish (a) one or more attributes, or one or more combinations of attributes, that are the key, or the keys, of a relation and (b) distinguish the attributes that are not, or are not part of, the key or keys.
Regarding the scenario you describe, let us suppose that the grid entitled Transmissions is a concrete representation of a mathematical relation. With respect to the values of the columns labeled as submodel, body_code, engine_code, engine, desplacement, cylinders, trans_type, trans_code and qualifier, one can say that:
-
Some (engine_code and trans_code) appear to be determined by the values of what would be the key of a relation called, let us say, Vehicle — i.e., the combination of (model, make, year).
-
Some (transmission_type and qualifier) appear to be determined by the values of what would be the key of a relation called, let us say, TransmissionAdaptor — i.e., trans_code.
-
Some (engine_type, displacement and cylinders) appear to be determined by the values of what would be the key of a relation called, let us say, Engine — i.e., engine_code.
But those are simple assumptions, based on my personal interpretation of the information contained in the aforementioned grid, and evidently I am not familiar at all with the business context in question. Therefore, in order to get rid of those kinds of needless and problematic assumptions you have to interview the business experts, and they will help you to identify the FDs, which would in turn guide you in performing proper normalization and laying out the database structure with the precision that the database administration profession requires. If there are no business experts who you can resort to, then you will have to dive into the data sets, observe data usage and meaning, and analyze carefully the interconnections between the pieces of information of interest to determine the significant FDs on your own.
A normalized set
Having a normalized database helps to avoid update/modification anomalies (affecting INSERT, UPDATE and DELETE operations) that arise eventually when there exist undesirable dependencies among the attributes (columns) of the relations (tables) under consideration. In cases in which, e.g., there are attributes of a relation that (i) depend on non-key attributes, or that (ii) depend on parts of a composite —i.e., multi-attribute— key, the designer has to decompose the concerning relation into two or more.
Thus, most of the time, a normalized set is made up of various relations, each of which is meant to contain exactly one particular type of fact in its tuples (rows). Contrarily, unnormalized and non-fully-normalized sets consist of one or more relations that contain more than one type of fact in their tuples.
The design of a database starting from the conceptual level of abstraction
On the other hand, you can design the relevant database starting from a different perspective, analyzing first the structure and associations —or relationshipsa or connections— of the types of things of interest from a purely conceptual point of view, without thinking about relations (tables), attributes (columns), constraints and normalization yet. Of course, this approach also entails close communication between the database designer(s) and the business experts or, in absence of business experts, database designer(s) with in-depth knowledge of the informational needs and characteristics of the business environment.
Sample business rules
So I will put together some hypothetical business rules that will help in the creation of an expository conceptual schema based on mere assumptions about the information contained in your question.
A Vehicle is:
An Engine:
A Transmiss
In order to carry out a normalization exercise involving second and further normal forms —as per the relational model of data by Dr. E. F Codd—, one first has to know what are the relevant functional dependencies (FDs for brevity) between the attributes (usually depicted as columns) of an adapted mathematical relation (usually portrayed as a table). This kind of exercise belongs to the logical-level of abstraction of a database. That is why, in order to address your stated intention of obtaining a normalized set, I requested information about the applicable FDs via comments.
For instance, a FD involving the hypothetical attributes Foo and Bar can be depicted as Foo → Bar, which may in turn be read as “attribute Foo determines attribute Bar” or “attribute Bar is determined by attribute Foo”. In this way one can distinguish (a) one or more attributes, or one or more combinations of attributes, that are the key, or the keys, of a relation and (b) distinguish the attributes that are not, or are not part of, the key or keys.
Regarding the scenario you describe, let us suppose that the grid entitled Transmissions is a concrete representation of a mathematical relation. With respect to the values of the columns labeled as submodel, body_code, engine_code, engine, desplacement, cylinders, trans_type, trans_code and qualifier, one can say that:
-
Some (engine_code and trans_code) appear to be determined by the values of what would be the key of a relation called, let us say, Vehicle — i.e., the combination of (model, make, year).
-
Some (transmission_type and qualifier) appear to be determined by the values of what would be the key of a relation called, let us say, TransmissionAdaptor — i.e., trans_code.
-
Some (engine_type, displacement and cylinders) appear to be determined by the values of what would be the key of a relation called, let us say, Engine — i.e., engine_code.
But those are simple assumptions, based on my personal interpretation of the information contained in the aforementioned grid, and evidently I am not familiar at all with the business context in question. Therefore, in order to get rid of those kinds of needless and problematic assumptions you have to interview the business experts, and they will help you to identify the FDs, which would in turn guide you in performing proper normalization and laying out the database structure with the precision that the database administration profession requires. If there are no business experts who you can resort to, then you will have to dive into the data sets, observe data usage and meaning, and analyze carefully the interconnections between the pieces of information of interest to determine the significant FDs on your own.
A normalized set
Having a normalized database helps to avoid update/modification anomalies (affecting INSERT, UPDATE and DELETE operations) that arise eventually when there exist undesirable dependencies among the attributes (columns) of the relations (tables) under consideration. In cases in which, e.g., there are attributes of a relation that (i) depend on non-key attributes, or that (ii) depend on parts of a composite —i.e., multi-attribute— key, the designer has to decompose the concerning relation into two or more.
Thus, most of the time, a normalized set is made up of various relations, each of which is meant to contain exactly one particular type of fact in its tuples (rows). Contrarily, unnormalized and non-fully-normalized sets consist of one or more relations that contain more than one type of fact in their tuples.
The design of a database starting from the conceptual level of abstraction
On the other hand, you can design the relevant database starting from a different perspective, analyzing first the structure and associations —or relationshipsa or connections— of the types of things of interest from a purely conceptual point of view, without thinking about relations (tables), attributes (columns), constraints and normalization yet. Of course, this approach also entails close communication between the database designer(s) and the business experts or, in absence of business experts, database designer(s) with in-depth knowledge of the informational needs and characteristics of the business environment.
Sample business rules
So I will put together some hypothetical business rules that will help in the creation of an expository conceptual schema based on mere assumptions about the information contained in your question.
A Vehicle is:
- primarily identified by exactly one combination of Model, Make and Year
- moved by exactly one Engine
- equipped with exactly one TransmissionAdaptor
- …
An Engine:
- is primarily identified by exactly one EngineCode
- is catalogued by exactly one EngineType
- has exactly one Displacement
- holds exactly one type of Cylinders
- is provided with exactly one FuelSystem
- is installed on zero, one or more Vehicles
- …
A Transmiss
Code Snippets
CREATE TABLE TransmissionAdaptor (
TransmissionAdaptorCode CHAR(10) NOT NULL,
TransmissionType CHAR(30) NOT NULL,
--
CONSTRAINT TransmissionAdaptor_PK PRIMARY KEY (TransmissionAdaptorCode)
);
CREATE TABLE FuelSystem (
Name CHAR(6) NOT NULL,
FuelSystemType CHAR(10) NOT NULL,
Pressure CHAR(10) NOT NULL,
FuelSystemName CHAR(30) NOT NULL,
--
CONSTRAINT FuelSystem_PK PRIMARY KEY (Name)
);
CREATE TABLE MyEngine (
EngineCode CHAR(3) NOT NULL,
EngineType CHAR(8) NOT NULL,
Displacement CHAR(3) NOT NULL,
Cylinders CHAR(3) NOT NULL,
FuelSystemName CHAR(6) NOT NULL,
--
CONSTRAINT Engine_PK PRIMARY KEY (EngineCode),
CONSTRAINT Engine_to_FuelSystem_FK FOREIGN KEY (FuelSystemName)
REFERENCES FuelSystem (Name)
);
CREATE TABLE Vehicle (
Model CHAR(20) NOT NULL,
Make CHAR(10) NOT NULL,
MyYear SMALLINT NOT NULL,
EngineCode CHAR(3) NOT NULL,
TransmissionAdaptorCode CHAR(10) NOT NULL,
--
CONSTRAINT Vehicle_PK PRIMARY KEY (Model, Make, MyYear),
CONSTRAINT Vehicle_to_Engine_FK FOREIGN KEY (EngineCode)
REFERENCES MyEngine (EngineCode),
CONSTRAINT Vehicle_to_Transmission_FK FOREIGN KEY (TransmissionAdaptorCode)
REFERENCES TransmissionAdaptor (TransmissionAdaptorCode)
);Context
StackExchange Database Administrators Q#220893, answer score: 7
Revisions (0)
No revisions yet.