patternsqlMinor
Db design for historical option data (~200GB)
Viewed 0 times
design200gboptionhistoricalfordata
Problem
I need some help determining how to best design my tables. I have about 200GB worth of data I wish to place in a table (historical option price data). The table looks something like:
95% of the time I will be fetching the current date's data. But the other 5% of the time I want to analyze historical data.
Most of the queries would involve some combination of symbol/date/exp.
I'm not sure how to best organize the table/s in terms of efficiency. Would it be best to put the data into two tables: historical and current? Or an index on the date? Or maybe an extra boolean column (isCurrent) which indicates which data is current or not?
If it matters, I'm using java/hibernate/mysql under the covers.
Thanks
String symbol
Date date
Date exp;
double strike;
double ask;
double bid;
95% of the time I will be fetching the current date's data. But the other 5% of the time I want to analyze historical data.
Most of the queries would involve some combination of symbol/date/exp.
I'm not sure how to best organize the table/s in terms of efficiency. Would it be best to put the data into two tables: historical and current? Or an index on the date? Or maybe an extra boolean column (isCurrent) which indicates which data is current or not?
If it matters, I'm using java/hibernate/mysql under the covers.
Thanks
Solution
The best first strategy is always to keep your transactional data all together in one, normalized data store. This minimizes the risk of data integrity problems and keeps the amount of code you have to write to a minimum.
Often, people will find that there is some kind of concern around performance. This leads them to consider denormalization. The best thing is to wait until your volume testing proves that there is a real problem with performance, rather than assuming that there will be one because your table is "big".
Given that you have a large amount of historical data that you don't use most of the time, it is possible that you may benefit from denormalization. You really should test this out before assuming it to be the case though. If it is the case, then one approach which might work well for you is to keep two data stores: (a) one where your current values are kept and (b) one where all values, including current and historical are kept.
If this makes sense to do (for performance reasons), you should build some automation to track all changes to the current data and record them in your history table. This is best done with triggers in DBMS that support them.
What almost never makes sense is keeping current values in one place and historical values (excluding current values) in another place. I have never come across a situation where you need to query only historical records, excluding current values. This means that your queries would have to be twice as complex in order to use current + historical values if you segregate them in this way.
This is what I would recommend to you:
Often, people will find that there is some kind of concern around performance. This leads them to consider denormalization. The best thing is to wait until your volume testing proves that there is a real problem with performance, rather than assuming that there will be one because your table is "big".
Given that you have a large amount of historical data that you don't use most of the time, it is possible that you may benefit from denormalization. You really should test this out before assuming it to be the case though. If it is the case, then one approach which might work well for you is to keep two data stores: (a) one where your current values are kept and (b) one where all values, including current and historical are kept.
If this makes sense to do (for performance reasons), you should build some automation to track all changes to the current data and record them in your history table. This is best done with triggers in DBMS that support them.
What almost never makes sense is keeping current values in one place and historical values (excluding current values) in another place. I have never come across a situation where you need to query only historical records, excluding current values. This means that your queries would have to be twice as complex in order to use current + historical values if you segregate them in this way.
This is what I would recommend to you:
- Keep you data in one place if you can (definitely index on date)
- Denormalize if you have to, but keep current values in your history table.
- Do not use a bit flag for "is_current". This almost never indexes well.
Context
StackExchange Database Administrators Q#39919, answer score: 3
Revisions (0)
No revisions yet.