patternsqlMinor
How large is SQLites minimum data storage overhead and what generates them?
Viewed 0 times
generateswhatandminimumstoragelargethemhowoverheaddata
Problem
Motivation:
In my company there exists the (probably strange) idea to store pure time series data in a normal SQL table (instead of using BLOBs or just binary files which would probably be a better idea). I have to prove this is a misconception (at least for our embedded device where some kind of meta data overhead is not possible, since data storage is limited).
What I tried:
I'm using SQLite. Here is the test table's DDL:
I believe there is no PK overhead and no index overhead. However, there is the B-Tree "overhead" and some (probably negligible) metadata "overhead".
I made some tests with the database schema above and it seemed the overhead is about 40 % of the user data (e.g. 200 MB of pure data results in 276 MB file size). I tried different amounts of data; insert time and data overhead seemed to be relatively constant.
40% seems very high - but I have nothing else than the DLL which creates the table, and a simple insert in a for loop creating the data.
I didn't examine speed yet, since if data storage overhead is a show stopper my "proof of misconception" is already successful.
My question:
Is 40% (in my example table structure) the inevitable overhead or is it possible to considerably reduce it to less than 5% overhead? What components actually cause this overhead?
I'm afraid this is already answered in sqlite fileformat doc, however I couldn't discover the relevant information there.
Output (extract) of sqlite3_analyzer.exe with 4 rows of float64 entries
```
All tables ***
Percentage of total database...................... 100.0%
Number of entries................................. 6253011
Bytes of storage consumed....
In my company there exists the (probably strange) idea to store pure time series data in a normal SQL table (instead of using BLOBs or just binary files which would probably be a better idea). I have to prove this is a misconception (at least for our embedded device where some kind of meta data overhead is not possible, since data storage is limited).
What I tried:
I'm using SQLite. Here is the test table's DDL:
CREATE TABLE DataTable(signal_0 FLOAT,
signal_1 FLOAT,
signal_2 FLOAT,
signal_3 FLOAT);I believe there is no PK overhead and no index overhead. However, there is the B-Tree "overhead" and some (probably negligible) metadata "overhead".
I made some tests with the database schema above and it seemed the overhead is about 40 % of the user data (e.g. 200 MB of pure data results in 276 MB file size). I tried different amounts of data; insert time and data overhead seemed to be relatively constant.
40% seems very high - but I have nothing else than the DLL which creates the table, and a simple insert in a for loop creating the data.
select count() from table gives me 6253010. This means 62530104*8 Bytes = 200 MB user data.I didn't examine speed yet, since if data storage overhead is a show stopper my "proof of misconception" is already successful.
My question:
Is 40% (in my example table structure) the inevitable overhead or is it possible to considerably reduce it to less than 5% overhead? What components actually cause this overhead?
I'm afraid this is already answered in sqlite fileformat doc, however I couldn't discover the relevant information there.
Output (extract) of sqlite3_analyzer.exe with 4 rows of float64 entries
```
All tables ***
Percentage of total database...................... 100.0%
Number of entries................................. 6253011
Bytes of storage consumed....
Solution
Thanks for posting the output from sqlite3_analyzer. The pertinent bits are:
Table DATAMONOTABLE
Percentage of total database...................... 99.999%
Number of entries................................. 6253010
Bytes of storage consumed......................... 276185088
Bytes of payload.................................. 231361370 83.8%
Bytes of metadata................................. 42847765 15.5%
B-tree depth...................................... 3
Average payload per entry......................... 37.00
Average unused bytes per entry.................... 0.32
Average metadata per entry........................ 6.85
The report says metadata is 15.5% of the total storage consumed for the table in question. Every row of data consists of 4 x floating-point values which consume 8 bytes each for a total of 32 bytes. Doing the math, this indicates SQLite is consuming 4.96 bytes per row for metadata. According to section 2.3 of SQLite File Format document, each ordinary table is a "b-tree table":
2.3. Representation Of SQL Tables
Each ordinary SQL table in the database schema is represented on-disk by a table b-tree. Each entry in the table b-tree corresponds to a row of the SQL table. The rowid of the SQL table is the 64-bit signed integer key for each entry in the table b-tree.
The content of each SQL table row is stored in the database file by first combining the values in the various columns into a byte array in the record format, then storing that byte array as the payload in an entry in the table b-tree. The order of values in the record is the same as the order of columns in the SQL table definition. When an SQL table includes an INTEGER PRIMARY KEY column (which aliases the rowid) then that column appears in the record as a NULL value. SQLite will always use the table b-tree key rather than the NULL value when referencing the INTEGER PRIMARY KEY column.
As mentioned above, each row has a rowid that is a 64-bit signed integer, this is pretty clearly 4 bytes per row, leaving only 0.96 bytes per row for other overhead including page headers, etc. Larger rows would consume less overhead per-row since the rowid is always a 64-bit integer.
Having said all that, size-on-disk is only one aspect to consider when storing data.
Implementing a DIY storage system is fraught with problems that have already been solved rather extensively by various database management systems. For instance, an ACID-compliant DBMS, such as SQL Server, PostgreSQL, or SQLite, will allow for guaranteed atomicity, consistency, isolation, and durability.
Locating a specific row will be extremely fast using a good DBMS, and aggregating data will be easy, fast, and extremely versatile.
I most often work with SQL Server, so I decided to create a quick mock-up of your data to see what the space requirements would be.
First, I'll create a new, blank database for testing.
Here, I'll create a table with 4 floating point values per row:
The table will have a clustered index, since that is the most commonly used data structure in SQL Server. Note this index is not defined as a UNIQUE index, so it will have a uniqueifier added automatically that will add 4 bytes-per-row.
Here I'm inserting exactly 6,253,011 rows into the table, as per your question. Every value inserted will be a cryptographically-generated random number.
Show the row-count:
Results:
╔══════════════════╗
║ (No column name) ║
╠══════════════════╣
║ 6253011 ║
╚══════════════════╝
Show the space used by the table:
```
DECLARE @Schema sysname;
DECLARE @Table sysname;
DECLARE @DSName sysname;
SET @Schema = 'dbo';
SET @Table = 'Floats';
SET @DSName = NULL; --we don't care which DataSpace the table resides in.
SELECT DataSpace = ds.name
, ObjectName = QUOTENAME(s.name) + '.' + QUOTENAME(o.name)
, IndexName = i.name
, IndexType = i.type_desc
, TotalMB = CONVERT(INT, total_pages * 8192E
Table DATAMONOTABLE
Percentage of total database...................... 99.999%
Number of entries................................. 6253010
Bytes of storage consumed......................... 276185088
Bytes of payload.................................. 231361370 83.8%
Bytes of metadata................................. 42847765 15.5%
B-tree depth...................................... 3
Average payload per entry......................... 37.00
Average unused bytes per entry.................... 0.32
Average metadata per entry........................ 6.85
The report says metadata is 15.5% of the total storage consumed for the table in question. Every row of data consists of 4 x floating-point values which consume 8 bytes each for a total of 32 bytes. Doing the math, this indicates SQLite is consuming 4.96 bytes per row for metadata. According to section 2.3 of SQLite File Format document, each ordinary table is a "b-tree table":
2.3. Representation Of SQL Tables
Each ordinary SQL table in the database schema is represented on-disk by a table b-tree. Each entry in the table b-tree corresponds to a row of the SQL table. The rowid of the SQL table is the 64-bit signed integer key for each entry in the table b-tree.
The content of each SQL table row is stored in the database file by first combining the values in the various columns into a byte array in the record format, then storing that byte array as the payload in an entry in the table b-tree. The order of values in the record is the same as the order of columns in the SQL table definition. When an SQL table includes an INTEGER PRIMARY KEY column (which aliases the rowid) then that column appears in the record as a NULL value. SQLite will always use the table b-tree key rather than the NULL value when referencing the INTEGER PRIMARY KEY column.
As mentioned above, each row has a rowid that is a 64-bit signed integer, this is pretty clearly 4 bytes per row, leaving only 0.96 bytes per row for other overhead including page headers, etc. Larger rows would consume less overhead per-row since the rowid is always a 64-bit integer.
Having said all that, size-on-disk is only one aspect to consider when storing data.
Implementing a DIY storage system is fraught with problems that have already been solved rather extensively by various database management systems. For instance, an ACID-compliant DBMS, such as SQL Server, PostgreSQL, or SQLite, will allow for guaranteed atomicity, consistency, isolation, and durability.
Locating a specific row will be extremely fast using a good DBMS, and aggregating data will be easy, fast, and extremely versatile.
I most often work with SQL Server, so I decided to create a quick mock-up of your data to see what the space requirements would be.
First, I'll create a new, blank database for testing.
USE master;
GO
IF EXISTS (SELECT 1 FROM sys.databases d WHERE d.name = 'SizeCalc')
BEGIN
ALTER DATABASE SizeCalc SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
DROP DATABASE SizeCalc;
END
CREATE DATABASE SizeCalc;
GOHere, I'll create a table with 4 floating point values per row:
USE SizeCalc;
CREATE TABLE dbo.Floats
(
x1 float NULL
, x2 float NULL
, x3 float NULL
, x4 float NULL
);The table will have a clustered index, since that is the most commonly used data structure in SQL Server. Note this index is not defined as a UNIQUE index, so it will have a uniqueifier added automatically that will add 4 bytes-per-row.
CREATE CLUSTERED INDEX cx_float
ON dbo.Floats(x1);Here I'm inserting exactly 6,253,011 rows into the table, as per your question. Every value inserted will be a cryptographically-generated random number.
INSERT INTO dbo.Floats (x1, x2, x3, x4)
SELECT TOP(6253011)
CONVERT(bigint, CRYPT_GEN_RANDOM(8)) / CONVERT(float, CONVERT(bigint, CRYPT_GEN_RANDOM(2)) + 0.000001)
, CONVERT(bigint, CRYPT_GEN_RANDOM(8)) / CONVERT(float, CONVERT(bigint, CRYPT_GEN_RANDOM(2)) + 0.000003)
, CONVERT(bigint, CRYPT_GEN_RANDOM(8)) / CONVERT(float, CONVERT(bigint, CRYPT_GEN_RANDOM(2)) + 0.000005)
, CONVERT(bigint, CRYPT_GEN_RANDOM(8)) / CONVERT(float, CONVERT(bigint, CRYPT_GEN_RANDOM(2)) + 0.000007)
FROM sys.syscolumns sc1
CROSS JOIN sys.syscolumns sc2
CROSS JOIN sys.syscolumns sc3
GOShow the row-count:
SELECT COUNT(1)
FROM dbo.Floats;Results:
╔══════════════════╗
║ (No column name) ║
╠══════════════════╣
║ 6253011 ║
╚══════════════════╝
Show the space used by the table:
```
DECLARE @Schema sysname;
DECLARE @Table sysname;
DECLARE @DSName sysname;
SET @Schema = 'dbo';
SET @Table = 'Floats';
SET @DSName = NULL; --we don't care which DataSpace the table resides in.
SELECT DataSpace = ds.name
, ObjectName = QUOTENAME(s.name) + '.' + QUOTENAME(o.name)
, IndexName = i.name
, IndexType = i.type_desc
, TotalMB = CONVERT(INT, total_pages * 8192E
Code Snippets
USE master;
GO
IF EXISTS (SELECT 1 FROM sys.databases d WHERE d.name = 'SizeCalc')
BEGIN
ALTER DATABASE SizeCalc SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
DROP DATABASE SizeCalc;
END
CREATE DATABASE SizeCalc;
GOUSE SizeCalc;
CREATE TABLE dbo.Floats
(
x1 float NULL
, x2 float NULL
, x3 float NULL
, x4 float NULL
);CREATE CLUSTERED INDEX cx_float
ON dbo.Floats(x1);INSERT INTO dbo.Floats (x1, x2, x3, x4)
SELECT TOP(6253011)
CONVERT(bigint, CRYPT_GEN_RANDOM(8)) / CONVERT(float, CONVERT(bigint, CRYPT_GEN_RANDOM(2)) + 0.000001)
, CONVERT(bigint, CRYPT_GEN_RANDOM(8)) / CONVERT(float, CONVERT(bigint, CRYPT_GEN_RANDOM(2)) + 0.000003)
, CONVERT(bigint, CRYPT_GEN_RANDOM(8)) / CONVERT(float, CONVERT(bigint, CRYPT_GEN_RANDOM(2)) + 0.000005)
, CONVERT(bigint, CRYPT_GEN_RANDOM(8)) / CONVERT(float, CONVERT(bigint, CRYPT_GEN_RANDOM(2)) + 0.000007)
FROM sys.syscolumns sc1
CROSS JOIN sys.syscolumns sc2
CROSS JOIN sys.syscolumns sc3
GOSELECT COUNT(1)
FROM dbo.Floats;Context
StackExchange Database Administrators Q#214300, answer score: 2
Revisions (0)
No revisions yet.