patternMinor
Large (>22 trillion items) geospatial dataset with rapid (<1s) read query performance
Viewed 0 times
rapidreadwithquerygeospatialdatasetitemslargeperformancetrillion
Problem
I'm in the process of designing a new system for a large geospatial data set that will require rapid read query performance. Therefore I want to see if anyone thinks it is possible or has experience/advice about suitable DBMSs, data structure, or alternative methods to achieve the required performance in the following situation:
Data will be continuously produced from processed satellite radar data, which will have global coverage. Based on the satellite resolution and land coverage of the globe, I estimate the full data set to produce values at 75 billion discrete locations on the globe. Over the life span of a single satellite, the output will produce up to 300 values at each of these locations (so a total data set of >22 trillion values). This is for one satellite, and there is already a second in orbit, with another two planned in the new few years. So there will be a lot of data! A single data item is very simple and will only consist of (longitude, latitude, value), but due to the number of items I estimate a single satellite to produce up to 100TB.
The written data should never need updating, as it will only grow as new satellite acquisitions are processed. Write performance is not important, but read performance is crucial. The goal of this project is to be able to visualize the data through a simple interface such as a layer over google maps, where each point has a colored value based on its average, gradient, or some function over time. (demo at end of post).
From these requirements, the database needs to be scalable and we are likely to look towards cloud solutions. The system needs to be able to deal with geospatial queries such as "points near (lat,lon)" and "points within (box)", and have read performance of < 1s for locating a single point, and polygons which contain up to 50,000 points (although up to 200,000 points would be preferable).
So far I have a test data set of ~750 million data items at 111 million locations. I've trialed a postgres/pos
Data will be continuously produced from processed satellite radar data, which will have global coverage. Based on the satellite resolution and land coverage of the globe, I estimate the full data set to produce values at 75 billion discrete locations on the globe. Over the life span of a single satellite, the output will produce up to 300 values at each of these locations (so a total data set of >22 trillion values). This is for one satellite, and there is already a second in orbit, with another two planned in the new few years. So there will be a lot of data! A single data item is very simple and will only consist of (longitude, latitude, value), but due to the number of items I estimate a single satellite to produce up to 100TB.
The written data should never need updating, as it will only grow as new satellite acquisitions are processed. Write performance is not important, but read performance is crucial. The goal of this project is to be able to visualize the data through a simple interface such as a layer over google maps, where each point has a colored value based on its average, gradient, or some function over time. (demo at end of post).
From these requirements, the database needs to be scalable and we are likely to look towards cloud solutions. The system needs to be able to deal with geospatial queries such as "points near (lat,lon)" and "points within (box)", and have read performance of < 1s for locating a single point, and polygons which contain up to 50,000 points (although up to 200,000 points would be preferable).
So far I have a test data set of ~750 million data items at 111 million locations. I've trialed a postgres/pos
Solution
How up-do-date do your read queries need to be?
You could partition the database by time if the map just needs to show the most recent measurement. This would reduce your query load for the map.
For the history of a given point, you could hold a second store by x and y showing the history. This could be done with a nightly refresh/update as the the historical data won't change.
Then you could pre-compute averages at more coarse resolutions for integrating with maps at different zoom levels. This would reduce the number of points to retrieve for large map areas (zoom out). Finer resolutions would be used for more zoomed in maps which were querying smaller areas. If you really need to speed this up you could compute tiles as blobs and interpret them in your application.
Because these would involve some re-computing of aggregate information there would be some latency in query results. Depending on how much latency was acceptable you could use this sort of approach to optimise your reads.
OK, so your points need to be computed averages over time. With this computation I guess your actual queries come down quite a lot from 22 trillion items as the raster values can be pre-calculated for querying.
You could partition the database by time if the map just needs to show the most recent measurement. This would reduce your query load for the map.
For the history of a given point, you could hold a second store by x and y showing the history. This could be done with a nightly refresh/update as the the historical data won't change.
Then you could pre-compute averages at more coarse resolutions for integrating with maps at different zoom levels. This would reduce the number of points to retrieve for large map areas (zoom out). Finer resolutions would be used for more zoomed in maps which were querying smaller areas. If you really need to speed this up you could compute tiles as blobs and interpret them in your application.
Because these would involve some re-computing of aggregate information there would be some latency in query results. Depending on how much latency was acceptable you could use this sort of approach to optimise your reads.
OK, so your points need to be computed averages over time. With this computation I guess your actual queries come down quite a lot from 22 trillion items as the raster values can be pre-calculated for querying.
Context
StackExchange Database Administrators Q#143342, answer score: 9
Revisions (0)
No revisions yet.