HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Numerical data with lots of NULLs: 6NF or document database?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
6nfnumericalwithlotsdatabasedocumentdatanulls

Problem

I have a source of numerical data mostly comprised of sensor readings, physical quantities and flags. Each associated with a datetime.

|  id  |        timetag           |   sensor1   |      sensor2     |  flag1  |  ...  |
--------------------------------------------------------------------------------------
| 943  | 2016-12-08 10:27:00.000  |  1.2323523  |  -23123.5346234  |    0    |  ...  |
| 944  | 2016-12-08 10:27:00.038  |     NULL    |     163.6151345  |    1    |  ...  |
| 945  | 2016-12-08 10:27:02.000  |  1.2477801  |       NULL       |    3    |  ...  |
                                       ...


The data has the following characteristics:

  • There are hundreds of fields, and new unexpected fields may pop up at any time. We need to record those too.



  • For every row, only a small subset (



  • Sometimes a field is available but corrupted, i.e. NaN, and we need to record that.



  • Data is written once, updated almost never and read relatively often.



  • We need to do range queries on the data, both on the date and on the actual numbers.



  • We don't want to lose any data.



  • It doesn't need to be a distributed system.



I'm wondering what the best way to store these data would be.

Solution 1: One big table

A single huge table would be a bad idea, because it would be full of NULLS and inconvenient. Moreover, it would be impossible to distinguish between "recorded but NaN" NULLs and "not recorded" NULLs.

Solution 2: 6th normal form

The first solution that comes to mind is to use 6NF, creating one table for each field.

This is clean in theory, but

  • queries would require many many JOINs at the same time;



  • new tables would have to be created on the fly when a new field arrives for the first time;



  • fields that are always paired, like for example longitude and latitude would have to be automatically detected and their tables merged for efficiency;



  • if at some point one of the columns in a "merged" table (e.g. lon in the lon-lat example above) start

Solution

JSONB might be a good option. If the TYPES of your flags and sensors are always the same (let's say they are all floats and ints), you could also use another strategy:

CREATE TABLE tb
    (id           serial primary key, 
     timetag      timestamp default now(),
     sensor_nr    integer,
     sensor_value float,
     flag_nr      integer,
     flag_value   integer
    ) ;


If you have real values for sensors 2 and 1000, you would then do:

INSERT INTO 
    tb 
    (sensor_nr, sensor_value) 
VALUES 
    (2, 123.456), 
    (1000, 0.123) ;


Or, if there are flags:

INSERT INTO 
    tb 
    (sensor_nr, sensor_value, flag_nr, flag_value)
VALUES
    (1000, 123.456, 1000, 0),
    (2, 234.567, 2, 1) ;


The NaN value would be represented by NULL; and the "no reading" would be represented by just the non-existence of the time/sensor_nr row.

If you need frequent range SELECTs of different types, you would index by timetag; and also by sensor_nr and sensor_value... The insert costs would be relatively high, but the SELECTs could be fast.

If the "new unexpected fields that pop at any type" are of different types (i.e.: you have some co-ordinate pairs (float,float) and not just simple floats) this approach won't be flexible enough. In that occasion, probably JSON(B) and the new indices is probably your best alternative; at the cost of losing (some) type safety.

Code Snippets

CREATE TABLE tb
    (id           serial primary key, 
     timetag      timestamp default now(),
     sensor_nr    integer,
     sensor_value float,
     flag_nr      integer,
     flag_value   integer
    ) ;
INSERT INTO 
    tb 
    (sensor_nr, sensor_value) 
VALUES 
    (2, 123.456), 
    (1000, 0.123) ;
INSERT INTO 
    tb 
    (sensor_nr, sensor_value, flag_nr, flag_value)
VALUES
    (1000, 123.456, 1000, 0),
    (2, 234.567, 2, 1) ;

Context

StackExchange Database Administrators Q#157538, answer score: 3

Revisions (0)

No revisions yet.