HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Scalable table structure for periodically updated stats that get aggregated over time

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
scalablestatstimegetperiodicallystructureforupdatedthatover

Problem

I receive statistical data every 30 seconds that I want to store in my database so that I can analyze later. For example, every 30 seconds I could receive the number of oranges sold at a store in the last 30 seconds. Later, I want to retrieve this data from the database and use it to generate charts showing information like the number of oranges sold for a store over the last 24 hours, last x weeks, last x months, and last x years.

If I just dump everything into one table, it seems like it would grow very quickly, especially if you have lots of data sources (stores). My thought was that the data could be averaged so that it was less granular over time. That is, keep detailed records over the last couple of hours (entries in the DB for every 30 seconds), then perhaps averages of 15 minute time spans for the last few weeks, then keep averages of each day for the last few months, etc.

This way you have a large number of recent records, a good number of relatively older records, and a few old records. However all the data is still there, it's just summed up and averaged into one entry over days or months instead of 30 seconds.

Does this approach make sense? Is there a better approach? How would I organize this into a table? Would it be multiple tables? Is SQL (probably MySQL) a good fit or would something work better? Any thoughts on this would be greatly appreciated!

Solution

I'm frankly always nervous about throwing away details. For that reason I'd try to find a way to keep handy (or at least archive) the most granular data. That way, if your summarization requirements change, you can always go back to the source data and recalculate appropriate summary statistics.

As to your approach of calculating averages over varying times at various ages, this is a fair practice that would meet the needs of many organizations, to a point.

While it may certainly be true that you don't care about per 30 second periods after a few weeks, let alone one year from now, at what point do you stop increasing the reporting period? Do you stop at days or do you go to weeks? (or months? or quarters?)

Once you get past days, you run into issues with doing year over year comparisons, which many organizations rely upon. I have seen some datawarehouses where there are multiple periods precalculated. Whether you do this or just pick one "longest period" e.g. days, is a matter of how you trade off quick access and redundant data.

Regarding "How would I organize this into a table?" the best way would be to have starting and ending date/times (down to the second) for the period being represented, along with the average count over that period. For convenience, you might also include a partitioning attribute that describes the length of the period, e.g. 'm', 'h', 'd', 'w', 'M', 'q', 'y' or whatever makes sense for your summaries.

Context

StackExchange Database Administrators Q#7036, answer score: 6

Revisions (0)

No revisions yet.