patternCritical
Is there a name for this database schema of key values?
Viewed 0 times
thisdatabasenameforvaluesthereschemakey
Problem
We process a routine data feed from a client who just refactored their database from a form that seems familiar (one row per entity, one column per attribute) to one that seems unfamiliar to me (one row per entity per attribute):
Before: one column per attribute
After: one column for all attributes
Is there a name for this database structure? What are the relative advantages? The old way seems easier to place validity constraints on specific attributes (non-null, non-negative, etc.) and easier to calculate averages. But I can see how it might be easier to add new attributes without refactoring the database. Is this a standard/preferred way of structuring data?
Before: one column per attribute
ID Ht_cm wt_kg Age_yr ...
1 190 82 43 ...
2 170 60 22 ...
3 205 90 51 ...After: one column for all attributes
ID Metric Value
1 Ht_cm 190
1 Wt_kg 82
1 Age_yr 43
1 ...
2 Ht_cm 170
2 Wt_kg 60
2 Age_yr 22
2 ...
3 Ht_cm 205
3 Wt_kg 90
3 Age_yr 51
3 ...Is there a name for this database structure? What are the relative advantages? The old way seems easier to place validity constraints on specific attributes (non-null, non-negative, etc.) and easier to calculate averages. But I can see how it might be easier to add new attributes without refactoring the database. Is this a standard/preferred way of structuring data?
Solution
It's called Entity-Attribute-Value (also sometimes 'name-value pairs') and it's a classic case of "a round peg in a square hole" when people use the EAV pattern in a relational database.
Here's a list of why you shouldn't use EAV:
Compare:
To:
Here's a (very short) list of when you should use EAV:
I know I just spent this entire post detailing why EAV is a terrible idea in most cases - but there are a few cases where it's needed/unavoidable. however, most of the time (including the example above), it's going to be far more hassle than it's worth. If you have a requirement for wide support of EAV-type data input, you should look at storing them in a key-value system, e.g. Hadoop/HBase, CouchDB, MongoDB, Cassandra, BerkeleyDB.
Here's a list of why you shouldn't use EAV:
- You can't use data types. It doesn't matter if the value is a date, a number or money (decimal). It's always going to be coerced to varchar. This can be anything from a minor performance problem to a massive gut-ache (ever had to chase down a one-cent variation in a monthly roll-up report?).
- You can't (easily) enforce constraints. It requires a ridiculous amount of code to enforce "Everyone needs to have a height between 0 and 3 metres" or "Age must be not null and >= 0", as opposed to the 1-2 lines that each of those constraints would be in a properly-modelled system.
- Related to above, you can't easily guarantee that you get the information you need for each client (age might be missing from one, then the next might be missing their height etc.). You can do it, but it's a hell of a lot more difficult than
SELECT height, weight, age FROM Client where height is null or weight is null.
- Related again, duplicate data is a lot harder to detect (what happens if they give you two ages for one client? De-EAVing the data, as below, will give you two rows of results if you have one attribute doubled. If one client has two separate entries for two attributes, you'll get four rows from the query below).
- You can't even guarantee that the attribute names are consistent. "Age_yr" might become "AGE_IN_YEARS" or "age". (Admittedly this is less of a problem when you're receiving an extract versus when people are inserting data, but still.)
- Any sort of nontrivial query is a complete disaster. To relationalise a three-attribute EAV system so you can query it in a rational fashion requires three joins of the EAV table.
Compare:
SELECT cID.ID AS [ID], cH.Value AS [Height], cW.Value AS [Weight], cA.Value AS [Age]
FROM (SELECT DISTINCT ID FROM Client) cID
LEFT OUTER JOIN
Client cW ON cID.ID = cW.ID AND cW.Metric = "Wt_kg"
LEFT OUTER JOIN
Client cH ON cID.ID = cH.ID AND cW.Metric = "Ht_cm"
LEFT OUTER JOIN
Client cA ON cID.ID = cA.ID AND cW.Metric = "Age_yr"To:
SELECT c.ID, c.Ht_cm, c.Wt_kg, c.Age_yr
FROM Client cHere's a (very short) list of when you should use EAV:
- When there's absolutely no way around it and you have to support schema-less data in your database.
- When you just need to store "stuff" and don't expect to have to need it in a more structured form. Beware, though, the monster called "changing requirements".
I know I just spent this entire post detailing why EAV is a terrible idea in most cases - but there are a few cases where it's needed/unavoidable. however, most of the time (including the example above), it's going to be far more hassle than it's worth. If you have a requirement for wide support of EAV-type data input, you should look at storing them in a key-value system, e.g. Hadoop/HBase, CouchDB, MongoDB, Cassandra, BerkeleyDB.
Code Snippets
SELECT cID.ID AS [ID], cH.Value AS [Height], cW.Value AS [Weight], cA.Value AS [Age]
FROM (SELECT DISTINCT ID FROM Client) cID
LEFT OUTER JOIN
Client cW ON cID.ID = cW.ID AND cW.Metric = "Wt_kg"
LEFT OUTER JOIN
Client cH ON cID.ID = cH.ID AND cW.Metric = "Ht_cm"
LEFT OUTER JOIN
Client cA ON cID.ID = cA.ID AND cW.Metric = "Age_yr"SELECT c.ID, c.Ht_cm, c.Wt_kg, c.Age_yr
FROM Client cContext
StackExchange Database Administrators Q#20759, answer score: 99
Revisions (0)
No revisions yet.