HiveBrain v1.2.0
Get Started
← Back to all entries
snippetMinor

How to handle queries of 500M+ items

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
500mhandleitemshowqueries

Problem

The structure of my data is the following:

date: 
filter_a:  -> range [0, 1000]
filter_b:  -> range [0, 1000]
filter_c:  -> range [0, 86400]
filter_d:  -> range [0, 6]
group: 
second_group: 
variable_a: 
variable_b: 
variable_c: 
a couple more no very important


I need to perform the following queries:

First:

  • Filter data by date, filter_a, filter_b, filter_c and others



Second, with the filtered data:

  • count all records



  • get average of variable_a, variable_b and variable_c



  • get standard deviation of variable_a, variable_b and variable_c



  • get quartiles of variable_a, variable_b and variable_c



  • group data by group or second_group and aggregate(Count, Avg, Std, ..)



The number of the system's users is about 10 or 15, but the number of items is huge, right now it is 70M but it will be 500M in a couple of weeks and it will be 1000M in about a year.

The number of queries is small, no more than 10 users concurrently, my problem is how to handle those queries with this huge amount of data.

What have I tried so far?

-
I started with mongodb, at the beginning it was fast but it became slow when calculating quartiles with 10M+. It improved when I added indexes but it didn't help very much when I had to query all data. I started using mongodb because data was very dynamic but luckily the data format "isn't going to change anymore".

-
As filter_a and filter_b could be seen like nodes, I tried neo4j. I liked it neo4j very much but my graph had A LOT of edges so that queries wasn't very fast.

-
Finally, since data format isn't going to change and it is only one collection/table so needs no joins in SQL, I checked postgresql. My tests has been faster with postgresql, but I'm scared it could not scale properly in the future.

What do I need?

  • Is postgresql a good choice for this case?



  • Is there another kind of database I could use? which one is the best for this case?



  • What else could I do to improve it?



E

Solution

Instead of leaning on a relational database to perform these statistical calculations on time-series data, I'd suggest that you move this math and post-processing work outside of the database into a client application.

Using a scripting language like Python or Ruby, you can incremental solve the problem by querying for "chunks" of data over a fixed-width period of time, compute an intermediate statistical summary, and then combine the results across multiple chunks, as you loop over the whole history. Some statistical measures are hard to combine across chunks, but something like Avg() only needs sum() and count() per chunk, O(1) vs. O(chunksize), so chunk-merging may scale well.

Context

StackExchange Database Administrators Q#147676, answer score: 6

Revisions (0)

No revisions yet.