patternsqlMinor
Numerical data with lots of NULLs: 6NF or document database?
Viewed 0 times
6nfnumericalwithlotsdatabasedocumentdatanulls
Problem
I have a source of numerical data mostly comprised of sensor readings, physical quantities and flags. Each associated with a datetime.
The data has the following characteristics:
I'm wondering what the best way to store these data would be.
Solution 1: One big table
A single huge table would be a bad idea, because it would be full of NULLS and inconvenient. Moreover, it would be impossible to distinguish between "recorded but NaN" NULLs and "not recorded" NULLs.
Solution 2: 6th normal form
The first solution that comes to mind is to use 6NF, creating one table for each field.
This is clean in theory, but
| id | timetag | sensor1 | sensor2 | flag1 | ... |
--------------------------------------------------------------------------------------
| 943 | 2016-12-08 10:27:00.000 | 1.2323523 | -23123.5346234 | 0 | ... |
| 944 | 2016-12-08 10:27:00.038 | NULL | 163.6151345 | 1 | ... |
| 945 | 2016-12-08 10:27:02.000 | 1.2477801 | NULL | 3 | ... |
...The data has the following characteristics:
- There are hundreds of fields, and new unexpected fields may pop up at any time. We need to record those too.
- For every row, only a small subset (
- Sometimes a field is available but corrupted, i.e. NaN, and we need to record that.
- Data is written once, updated almost never and read relatively often.
- We need to do range queries on the data, both on the date and on the actual numbers.
- We don't want to lose any data.
- It doesn't need to be a distributed system.
I'm wondering what the best way to store these data would be.
Solution 1: One big table
A single huge table would be a bad idea, because it would be full of NULLS and inconvenient. Moreover, it would be impossible to distinguish between "recorded but NaN" NULLs and "not recorded" NULLs.
Solution 2: 6th normal form
The first solution that comes to mind is to use 6NF, creating one table for each field.
This is clean in theory, but
- queries would require many many JOINs at the same time;
- new tables would have to be created on the fly when a new field arrives for the first time;
- fields that are always paired, like for example longitude and latitude would have to be automatically detected and their tables merged for efficiency;
- if at some point one of the columns in a "merged" table (e.g. lon in the lon-lat example above) start
Solution
JSONB might be a good option. If the TYPES of your flags and sensors are always the same (let's say they are all floats and ints), you could also use another strategy:
If you have real values for sensors 2 and 1000, you would then do:
Or, if there are flags:
The NaN value would be represented by NULL; and the "no reading" would be represented by just the non-existence of the time/sensor_nr row.
If you need frequent range SELECTs of different types, you would index by timetag; and also by sensor_nr and sensor_value... The insert costs would be relatively high, but the SELECTs could be fast.
If the "new unexpected fields that pop at any type" are of different types (i.e.: you have some co-ordinate pairs (float,float) and not just simple floats) this approach won't be flexible enough. In that occasion, probably JSON(B) and the new indices is probably your best alternative; at the cost of losing (some) type safety.
CREATE TABLE tb
(id serial primary key,
timetag timestamp default now(),
sensor_nr integer,
sensor_value float,
flag_nr integer,
flag_value integer
) ;If you have real values for sensors 2 and 1000, you would then do:
INSERT INTO
tb
(sensor_nr, sensor_value)
VALUES
(2, 123.456),
(1000, 0.123) ;Or, if there are flags:
INSERT INTO
tb
(sensor_nr, sensor_value, flag_nr, flag_value)
VALUES
(1000, 123.456, 1000, 0),
(2, 234.567, 2, 1) ;The NaN value would be represented by NULL; and the "no reading" would be represented by just the non-existence of the time/sensor_nr row.
If you need frequent range SELECTs of different types, you would index by timetag; and also by sensor_nr and sensor_value... The insert costs would be relatively high, but the SELECTs could be fast.
If the "new unexpected fields that pop at any type" are of different types (i.e.: you have some co-ordinate pairs (float,float) and not just simple floats) this approach won't be flexible enough. In that occasion, probably JSON(B) and the new indices is probably your best alternative; at the cost of losing (some) type safety.
Code Snippets
CREATE TABLE tb
(id serial primary key,
timetag timestamp default now(),
sensor_nr integer,
sensor_value float,
flag_nr integer,
flag_value integer
) ;INSERT INTO
tb
(sensor_nr, sensor_value)
VALUES
(2, 123.456),
(1000, 0.123) ;INSERT INTO
tb
(sensor_nr, sensor_value, flag_nr, flag_value)
VALUES
(1000, 123.456, 1000, 0),
(2, 234.567, 2, 1) ;Context
StackExchange Database Administrators Q#157538, answer score: 3
Revisions (0)
No revisions yet.