HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Redshift DISTSTYLE ALL table takes less disk space?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
spacediskallredshiftlessdiststyletakestable

Problem

(Self-migrating from StackOverflow.)

We have a fat table of exactly 591 columns in Redshift, distributed by a key.

We were unaware that fat tables chew up space - 2 blocks per column per slice, minimum, and we aren't close to filling up the first block on any slice.

I misunderstood a coworker to recommend using "DISTSTYLE ALL." That seemed maddeningly counter-intuitive, but I set about trying this, and got a surprise that the disk usage of the new table, with the same rows and columns, is actually 1/20th of the size of the original table.

I'm using this query to get the sizes, but Aginity agrees when reporting disk usage for these tables:

select trim(pgdb.datname) as Database,
trim(a.name) as Table,
b.mbytes
from stv_tbl_perm a
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, sum(decode(unsorted, 1, 1, 0)) as unsorted_mbytes,
count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
join ( select sum(capacity) as total
from stv_partitions where part_begin=0 ) as part on 1=1
where a.slice=0 and pgdb.datname = 'my_schema'
AND a.name IN ('my_table',
'my_table_distall')
order by Database, Name;


Yielding this output:

Schema Table mbytes
my_schema my_table 222080
my_schema my_table_distall 10740


The reason I mentioned the exact number of columns is that 222080=5945322, which means we use two blocks for every 532 slices for every column (including Redshift's three hidden columns.)

The new table, with DISTSTYLE ALL, is taking up an average of 3.6 block per column per node.

This is very confusing. Why is my new table smaller by a factor of 20?

Due to the size of our data, I actually didn't expect it to get bigger, because even with all the data, we would not take up more than one block per slice per column, but smaller seems insane.

Is it just as simple as "DISTSTYLE ALL duplicates the table to one slice on each node?" So

Solution

Well, after lots of experimentation, we have found that, yes, if you have a fat table with not enough rows to fill much of a block at the slice level using DISTSTYLE KEY, it is much more efficient, space-wise, to use DISTSTYLE ALL, even though this replicates more data.

For example, our fat tables have, say, 600000 rows and 1000 columns. Then, on a 5-node 32-slice setup, they were using up 1Mb per column per node per slice, or 160 Gb, and we were only filling less than 1% of that space.

DISTSTYLE ALL appears to not use all the slices.

So using DISTSTYLE ALL, we multiplied the data by 5, but this still takes up less than 1/20th of the space that the DISTSTYLE KEY tables did.

It slows down our queries a little (due to the need for IPC between the slices, and because the DISTSTYLE KEY means we have only the data on the node that matches our other data) but, since these tables are loaded once and never change after, we don't suffer from the other problems of DISTSTYLE ALL.

Context

StackExchange Database Administrators Q#128197, answer score: 4

Revisions (0)

No revisions yet.