patternsqlCritical
Can MySQL reasonably perform queries on billions of rows?
Viewed 0 times
rowscanreasonablymysqlperformbillionsqueries
Problem
I am planning on storing scans from a mass spectrometer in a MySQL database and
would like to know whether storing and analyzing this amount of data is remotely
feasible. I know performance varies wildly depending on the environment, but I'm
looking for the rough order of magnitude: will queries take 5 days or 5
milliseconds?
Input format
Each input file contains a single run of the spectrometer; each run is comprised
of a set of scans, and each scan has an ordered array of datapoints. There is a
bit of metadata, but the majority of the file is comprised of arrays 32- or
64-bit ints or floats.
Host system
|----------------+-------------------------------|
| OS | Windows 2008 64-bit |
| MySQL version | 5.5.24 (x86_64) |
| CPU | 2x Xeon E5420 (8 cores total) |
| RAM | 8GB |
| SSD filesystem | 500 GiB |
| HDD RAID | 12 TiB |
|----------------+-------------------------------|
There are some other services running on the server using negligible processor
time.
File statistics
|------------------+--------------|
| number of files | ~16,000 |
| total size | 1.3 TiB |
| min size | 0 bytes |
| max size | 12 GiB |
| mean | 800 MiB |
| median | 500 MiB |
| total datapoints | ~200 billion |
|------------------+--------------|
The total number of datapoints is a very rough estimate.
Proposed schema
I'm planning on doing things "right" (i.e. normalizing the data like crazy) and
so would have a
and a
The 200 Billion datapoint question
I am going to be analyzing across multiple spectra and possibly even multiple
runs, resulting in queries which could touch millions of rows. Assuming I index
everything properly (which is a topic for another question)
would like to know whether storing and analyzing this amount of data is remotely
feasible. I know performance varies wildly depending on the environment, but I'm
looking for the rough order of magnitude: will queries take 5 days or 5
milliseconds?
Input format
Each input file contains a single run of the spectrometer; each run is comprised
of a set of scans, and each scan has an ordered array of datapoints. There is a
bit of metadata, but the majority of the file is comprised of arrays 32- or
64-bit ints or floats.
Host system
|----------------+-------------------------------|
| OS | Windows 2008 64-bit |
| MySQL version | 5.5.24 (x86_64) |
| CPU | 2x Xeon E5420 (8 cores total) |
| RAM | 8GB |
| SSD filesystem | 500 GiB |
| HDD RAID | 12 TiB |
|----------------+-------------------------------|
There are some other services running on the server using negligible processor
time.
File statistics
|------------------+--------------|
| number of files | ~16,000 |
| total size | 1.3 TiB |
| min size | 0 bytes |
| max size | 12 GiB |
| mean | 800 MiB |
| median | 500 MiB |
| total datapoints | ~200 billion |
|------------------+--------------|
The total number of datapoints is a very rough estimate.
Proposed schema
I'm planning on doing things "right" (i.e. normalizing the data like crazy) and
so would have a
runs table, a spectra table with a foreign key to runs,and a
datapoints table with a foreign key to spectra.The 200 Billion datapoint question
I am going to be analyzing across multiple spectra and possibly even multiple
runs, resulting in queries which could touch millions of rows. Assuming I index
everything properly (which is a topic for another question)
Solution
I am not very familiar with your needs, but perhaps storing each data point in the database is a bit of overkill. It sound almost like taking the approach of storing an image library by storing each pixel as a separate record in a relational database.
As a general rule, storing binary data in databases is wrong most of the time. There is usually a better way of solving the problem. While it is not inherently wrong to store binary data in relational database, often times the disadvantages outweigh the gains. Relational databases, as the name alludes to, are best suited for storing relational data. Binary data is not relational. It adds size (often significantly) to databases, can hurt performance, and may lead to questions about maintaining billion-record MySQL instances. The good news is that there are databases especially well suited for storing binary data. One of them, while not always readily apparent, is your file system! Simply come up with a directory and file naming structure for your binary files, store those in your MySQL DB together with any other data which may yield value through querying.
Another approach would be using a document-based storage system for your datapoints (and perhaps spectra) data, and using MySQL for the runs (or perhaps putting the runs into the same DB as the others).
As a general rule, storing binary data in databases is wrong most of the time. There is usually a better way of solving the problem. While it is not inherently wrong to store binary data in relational database, often times the disadvantages outweigh the gains. Relational databases, as the name alludes to, are best suited for storing relational data. Binary data is not relational. It adds size (often significantly) to databases, can hurt performance, and may lead to questions about maintaining billion-record MySQL instances. The good news is that there are databases especially well suited for storing binary data. One of them, while not always readily apparent, is your file system! Simply come up with a directory and file naming structure for your binary files, store those in your MySQL DB together with any other data which may yield value through querying.
Another approach would be using a document-based storage system for your datapoints (and perhaps spectra) data, and using MySQL for the runs (or perhaps putting the runs into the same DB as the others).
Context
StackExchange Database Administrators Q#20335, answer score: 127
Revisions (0)
No revisions yet.