patternsqlMinor
How should I model a database structure to store information about words?
Viewed 0 times
howwordsstoredatabasestructureaboutshouldmodelinformation
Problem
I have an application where the user can click on any word in a text, which will bring back related information about the base word and the specific inflection (word form) of the word clicked. I've tried representing this in a multitude of ways, but now that I'm close to launch, I'm worried as hell about the best way to represent it (this is my first ever project with programming, so I have no idea what I'm in for).
Considerations
The way I had originally set this up was that there was a
I then learned that this was a bad idea in terms of normalization, and I also realized that I needed to store specific
In the second iteration, I had three tables, a
However, I've been re-thinking this setup, because semantically speaking, a base word is also a word, just a specific type of word. So in this case, I feel like a “self-join” table might be most appropriate, although I could be completely overthinking this. In the “self-join” table, there would be an
Note: I found these two similar questions which didn't seem to completely answer my questions, so I'm posting my twist on these here:
-
Table design for a dictionary that can have words with many different spellings
-
How can I represent a list of words and related words in a relational structure?
Considerations
The way I had originally set this up was that there was a
base word table that contained the base word, part of speech, and then an array of word forms for each base word.I then learned that this was a bad idea in terms of normalization, and I also realized that I needed to store specific
inflection information for the words (mood, tense, person, etc.), so I rearranged it.In the second iteration, I had three tables, a
base_word table that had the base_word and the part of speech, a word table that had the word and its inflection information, and then a “join” table between them that contained the base_word_id and the word_id.However, I've been re-thinking this setup, because semantically speaking, a base word is also a word, just a specific type of word. So in this case, I feel like a “self-join” table might be most appropriate, although I could be completely overthinking this. In the “self-join” table, there would be an
is_baseword boolean attribute, which would theoretically allow all the inflection columns to store NULL marks if it was set to true. Is this a better idea?Note: I found these two similar questions which didn't seem to completely answer my questions, so I'm posting my twist on these here:
-
Table design for a dictionary that can have words with many different spellings
-
How can I represent a list of words and related words in a relational structure?
Solution
I believe your design for a one to many join with
Don't get too concerned with perfect normalization. While a perfectly normalized system may adhere to strict information science principles, it is often better in the real world to have a slightly denormalized database for other reasons.
base_word in one table and word forms in another is fine. The reasons I make this judgement are:- The
base_word, as an entity, is sufficiently different toword_formsas to represent a different entity. For instance, if you were to include the words origin language, this would also be only applicable to the base word entity, not its word forms.
- Storing the data in the same table will have implications on the size of the tables due to having sparse columns with attributes that are associated with
base_wordORword_formonly, meaning they are NULL or empty for the others. You point this out with theis_basewordandinflectioncolumns. Having a bunch of NULL values on a row which will never be changed is usually a reason to use the EAV model or split into different entities.
- Performance considerations for queries should be taken into account. With the additional size consideration a self join to a sparsely populated table will likely not be as performant. You will probably find yourself indexing a single table for
base_wordsORword_formsseparately. The indexes will be larger and more difficult to maintain for inserts/deletes etc. Better to have indexes on smaller tables and to have as few as possible on a table.
Don't get too concerned with perfect normalization. While a perfectly normalized system may adhere to strict information science principles, it is often better in the real world to have a slightly denormalized database for other reasons.
Context
StackExchange Database Administrators Q#162364, answer score: 2
Revisions (0)
No revisions yet.