snippetMinor
How to handle queries of 500M+ items
Viewed 0 times
500mhandleitemshowqueries
Problem
The structure of my data is the following:
I need to perform the following queries:
First:
Second, with the filtered data:
The number of the system's users is about 10 or 15, but the number of items is huge, right now it is 70M but it will be 500M in a couple of weeks and it will be 1000M in about a year.
The number of queries is small, no more than 10 users concurrently, my problem is how to handle those queries with this huge amount of data.
What have I tried so far?
-
I started with
-
As
-
Finally, since data format isn't going to change and it is only one collection/table so needs no joins in SQL, I checked postgresql. My tests has been faster with postgresql, but I'm scared it could not scale properly in the future.
What do I need?
E
date:
filter_a: -> range [0, 1000]
filter_b: -> range [0, 1000]
filter_c: -> range [0, 86400]
filter_d: -> range [0, 6]
group:
second_group:
variable_a:
variable_b:
variable_c:
a couple more no very importantI need to perform the following queries:
First:
- Filter data by
date,filter_a,filter_b,filter_cand others
Second, with the filtered data:
- count all records
- get average of
variable_a,variable_bandvariable_c
- get standard deviation of
variable_a,variable_bandvariable_c
- get quartiles of
variable_a,variable_bandvariable_c
- group data by
grouporsecond_groupand aggregate(Count, Avg, Std, ..)
The number of the system's users is about 10 or 15, but the number of items is huge, right now it is 70M but it will be 500M in a couple of weeks and it will be 1000M in about a year.
The number of queries is small, no more than 10 users concurrently, my problem is how to handle those queries with this huge amount of data.
What have I tried so far?
-
I started with
mongodb, at the beginning it was fast but it became slow when calculating quartiles with 10M+. It improved when I added indexes but it didn't help very much when I had to query all data. I started using mongodb because data was very dynamic but luckily the data format "isn't going to change anymore".-
As
filter_a and filter_b could be seen like nodes, I tried neo4j. I liked it neo4j very much but my graph had A LOT of edges so that queries wasn't very fast.-
Finally, since data format isn't going to change and it is only one collection/table so needs no joins in SQL, I checked postgresql. My tests has been faster with postgresql, but I'm scared it could not scale properly in the future.
What do I need?
- Is postgresql a good choice for this case?
- Is there another kind of database I could use? which one is the best for this case?
- What else could I do to improve it?
E
Solution
Instead of leaning on a relational database to perform these statistical calculations on time-series data, I'd suggest that you move this math and post-processing work outside of the database into a client application.
Using a scripting language like Python or Ruby, you can incremental solve the problem by querying for "chunks" of data over a fixed-width period of time, compute an intermediate statistical summary, and then combine the results across multiple chunks, as you loop over the whole history. Some statistical measures are hard to combine across chunks, but something like Avg() only needs sum() and count() per chunk, O(1) vs. O(chunksize), so chunk-merging may scale well.
Using a scripting language like Python or Ruby, you can incremental solve the problem by querying for "chunks" of data over a fixed-width period of time, compute an intermediate statistical summary, and then combine the results across multiple chunks, as you loop over the whole history. Some statistical measures are hard to combine across chunks, but something like Avg() only needs sum() and count() per chunk, O(1) vs. O(chunksize), so chunk-merging may scale well.
Context
StackExchange Database Administrators Q#147676, answer score: 6
Revisions (0)
No revisions yet.