HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlModerate

Are duplicated NVARCHAR values stored as copies in SQL Server?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
storednvarcharcopiesaresqlserverduplicatedvalues

Problem

I'm designing a table that will contain a lot of rows. So need to be careful not to store to much information. One of the columns is a NVARCHAR(MAX) column and it contains the address of our customers. As addresses do not change often, this column will contain many repeated values and thus contains quite some redundancy.

So I was wondering if I need to normalize this myself by maintaining some sort of look-up table to address strings (note that if an address changes I need to maintain history - so it's not a matter of usual normalization), or if SQL Server is pointing to the same reference of the string behind the scenes. Or maybe it offers a column option to do so. Another approach that came into my mind is to use COMPRESS but I guess this does not make sense as the data itself (i.e. the address) is not long.

Reading/writing performance is not so much of a concern as the data will be accumulated over time.

Solution

Under "normal" conditions, no, data in VARCHAR and NVARCHAR columns is not de-duped (although duplicate attribute and/or element names in a single XML value are reduced to a unique instance).

Using one of the Data Compression options is probably your best bet. Here are some things to consider:

  • Unicode Compression (part of Row Compression) only works on NVARCHAR(1 - 4000), not NVARCHAR(MAX) (please vote for / support: Unicode compression NVARCHAR(MAX)).



  • Page Compression can work with NVARCHAR(MAX), but only for in-row data. Off-row data (LOB pages) is not compressed.



Since the data won't really be changing, you should look into the Columnstore Index options (also available in Azure SQL Database):

  • Columnstore indexes: Overview



  • Columnstore indexes - Design guidance



Columnstore compression should be better than Page compression.

 

Also, you should probably avoid using the SPARSE option, due to:

  • The SPARSE option offers no benefit compared to Data Compression.



  • It is mainly intended for columns sets / wide tables (i.e. up to 30,000 columns).



  • It mostly helps with fixed-length datatypes (e.g. INT, DATETIME). So, prior to Data Compression being available, it was useful for CHAR and NCHAR, but not for VARCHAR or NVARCHAR as they don't take up space when NULL.



  • It only benefits columns set to NULL.



  • It slightly hurts non-NULL values by adding 2 bytes to each one.



  • You should probably have NULL for 50% - 60% of the rows in order to get enough savings for it to be worth using this option.



For more details on working with character data, please see the following post of mine:

How Many Bytes Per Character in SQL Server: a Completely Complete Guide

Context

StackExchange Database Administrators Q#279876, answer score: 12

Revisions (0)

No revisions yet.