HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

How does the XML_COMPRESSION option work?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
theoptionworkdoeshowxml_compression

Problem

XML_COMPRESSION has recently gone GA in Azure SQL Database

I've been trying to find some details about how it works to understand the pros and cons and so far not found any specifics.

Trying the below on SQL Server 2022 ...

CREATE TABLE [dbo].[XmlCompressionTest](
Id INT IDENTITY CONSTRAINT PK_XmlCompressionTest PRIMARY KEY  WITH (XML_COMPRESSION = ON),
Marker VARBINARY(6),
TestXml XML,
Compressed varbinary(MAX)
);

--https://codebeautify.org/generate-random-xml
DECLARE @XmlAsString NVARCHAR(MAX) = N'
  zoo
  -1069239245.2026982
  gather
  
    
      -1618074814
      2070665023
      six
      join
      -920111498.4217186
      176219184.84385443
    
    floating
    -2102390055.351131
    better
    amount
    aid
  
  you
  -82882542.41709375
';

INSERT [dbo].[XmlCompressionTest]
VALUES (0x7E7E7E7E7E7E, @XmlAsString, COMPRESS(@XmlAsString))

SELECT *, sys.fn_PhysLocFormatter(%%physloc%%)
FROM [dbo].[XmlCompressionTest]


And then running DBCC PAGE on the resultant page I see the following for the three variable length columns

DBCC PAGE reports that "TestXml" has a Length of 730 but physical length of 644, whereas the "Compressed" column has a length of 420. So COMPRESS clearly wins on the basis of raw compression.

I see that the "TestXml" column (purple highlighted) appears to have some sort of header info with quite a few 0x00 - I assume this is potentially some sort of structure to speed up operations against the XML?

Anyone know the details of XML_COMPRESSION and its pros and cons vs

  • No compression (i.e. are we trading off much here for the lower storage footprint?)



  • Manual compression



?

Solution

It was confirmed on the Azure announcements post that it uses Xpress Compression Algorithm AKA LZXpress.

The documentation states

This algorithm efficiently compresses data that contain repeated byte
sequences. It is not designed to compress image, audio, or video data.
Between the trade-offs of compressed size and CPU cost, it heavily
emphasizes low CPU cost. Source

and

A protocol that depends on this algorithm would typically need to
transfer significant amounts of data that cannot be easily
precompressed by another algorithm having a better compression ratio. Source

This has a few variants.

DBCC PAGE with option 3 shows the uncompressed hex so I was able to use that in a C# console app and experiment with different compression options. COMPRESSION_FORMAT_XPRESS_HUFF | COMPRESSION_ENGINE_MAXIMUM seemed to be a very close match.

using System.Runtime.InteropServices;

[DllImport("ntdll.dll")]
static extern uint RtlGetCompressionWorkSpaceSize(ushort CompressionFormat, out uint pNeededBufferSize, out uint Unknown);

[DllImport("ntdll.dll")]
static extern uint RtlCompressBuffer(ushort CompressionFormat, byte[] SourceBuffer, int SourceBufferLength, byte[] DestinationBuffer,
    int DestinationBufferLength, uint Unknown, out int pDestinationSize, IntPtr WorkspaceBuffer);

[DllImport("kernel32.dll", CharSet = CharSet.Auto, SetLastError = true)]
static extern IntPtr LocalAlloc(int uFlags, IntPtr sizetdwBytes);

[DllImport("kernel32.dll", SetLastError = true)]
static extern IntPtr LocalFree(IntPtr hMem);

const uint STATUS_SUCCESS = 0x00000000;
const uint STATUS_BUFFER_ALL_ZEROS = 0x00000117;
const uint STATUS_INVALID_PARAMETER = 0xC000000D;
const uint STATUS_UNSUPPORTED_COMPRESSION = 0xC000025F;
const uint STATUS_NOT_SUPPORTED = 0xC00000BB;
const uint STATUS_BUFFER_TOO_SMALL = 0xC0000023;

const ushort COMPRESSION_FORMAT_LZNT1 = 2;
const ushort COMPRESSION_FORMAT_XPRESS = 0x0003;
const ushort COMPRESSION_FORMAT_XPRESS_HUFF = 0x0004;

const ushort COMPRESSION_ENGINE_STANDARD = 0x0000;
const ushort COMPRESSION_ENGINE_MAXIMUM = 0x100;
const ushort COMPRESSION_ENGINE_HIBER = 0x0200;

var sourceHex =
    "dfff01b004f00472006f006f007400ef000001f801f00961007600610069006c00610062006c006500ef000002f80211037a006f006f00f7f0056600" +
    "6c0061006d006500ef000003f80311132d0031003000360039003200330039003200340035002e003200300032003600390038003200f7f0066c0069" +
    "00760069006e006700ef000004f8041106670061007400680065007200f7f00b6700720061006e006400660061007400680065007200ef000005f805" +
    "f004740065006c006c00ef000006f806f0036d0061007000ef000007f807110b2d003100360031003800300037003400380031003400f7f00274006f" +
    "00ef000008f808110a3200300037003000360036003500300032003300f7f00661006d006f0075006e007400ef000009f8091103730069007800f7f0" +
    "04770065006c006c00ef00000af80a11046a006f0069006e00f7f00477006f0072006500ef00000bf80b11122d003900320030003100310031003400" +
    "390038002e003400320031003700310038003600f7f0087100750065007300740069006f006e00ef00000cf80c111231003700360032003100390031" +
    "00380034002e0038003400330038003500340034003300f7f7f0047200610069006e00ef00000df80d110866006c006f006100740069006e006700f7" +
    "f00763006f006d00700061006e007900ef00000ef80e11122d0032003100300032003300390030003000350035002e00330035003100310033003100" +
    "f7f00470006f006c006500ef00000ff80f1106620065007400740065007200f7f00566007200750069007400ef000010f810110661006d006f007500" +
    "6e007400f7f0067700650061006c0074006800ef000011f8111103610069006400f7f7f0076200720065006100740068006500ef000012f812110379" +
    "006f007500f7f00869006e00630072006500610073006500ef000013f81311122d00380032003800380032003500340032002e003400310037003000" +
    "3900330037003500f7f7";

var sourceBytes = Convert.FromHexString(sourceHex);

var compressed = Compress(sourceBytes, COMPRESSION_FORMAT_XPRESS_HUFF | COMPRESSION_ENGINE_MAXIMUM);

Console.WriteLine("COMPRESSION_FORMAT_XPRESS_HUFF | COMPRESSION_ENGINE_MAXIMUM");
Console.WriteLine("");
Console.WriteLine(Convert.ToHexString(compressed));

static byte[] Compress(byte[] buffer, ushort compressionFormat)
{
    var outBuf = new byte[buffer.Length * 6];

    uint ret = RtlGetCompressionWorkSpaceSize(compressionFormat, out var dwSize, out _);
    if (ret != 0)
    {
        return null;
    }

    IntPtr hWork = LocalAlloc(0, new IntPtr(dwSize));
    ret = RtlCompressBuffer(compressionFormat, buffer,
        buffer.Length, outBuf, outBuf.Length, 0, out var dstSize, hWork);
    if (ret != 0)
    {
        return null;
    }

    LocalFree(hWork);

    Array.Resize(ref outBuf, dstSize);
    return outBuf;
}


The first 24 bytes in the column value shown in the question are additional to this output (i.e. the e8e8010078020000da0200006c020000da02000018000000)

The remainder of the column value is very similar to the output of the C# above (same length and identical until a few bytes near the end)

"The first 256 bytes indicate the bit length of each of the 512 Huffman symbols" - so that exp

Code Snippets

using System.Runtime.InteropServices;

[DllImport("ntdll.dll")]
static extern uint RtlGetCompressionWorkSpaceSize(ushort CompressionFormat, out uint pNeededBufferSize, out uint Unknown);

[DllImport("ntdll.dll")]
static extern uint RtlCompressBuffer(ushort CompressionFormat, byte[] SourceBuffer, int SourceBufferLength, byte[] DestinationBuffer,
    int DestinationBufferLength, uint Unknown, out int pDestinationSize, IntPtr WorkspaceBuffer);

[DllImport("kernel32.dll", CharSet = CharSet.Auto, SetLastError = true)]
static extern IntPtr LocalAlloc(int uFlags, IntPtr sizetdwBytes);

[DllImport("kernel32.dll", SetLastError = true)]
static extern IntPtr LocalFree(IntPtr hMem);

const uint STATUS_SUCCESS = 0x00000000;
const uint STATUS_BUFFER_ALL_ZEROS = 0x00000117;
const uint STATUS_INVALID_PARAMETER = 0xC000000D;
const uint STATUS_UNSUPPORTED_COMPRESSION = 0xC000025F;
const uint STATUS_NOT_SUPPORTED = 0xC00000BB;
const uint STATUS_BUFFER_TOO_SMALL = 0xC0000023;

const ushort COMPRESSION_FORMAT_LZNT1 = 2;
const ushort COMPRESSION_FORMAT_XPRESS = 0x0003;
const ushort COMPRESSION_FORMAT_XPRESS_HUFF = 0x0004;

const ushort COMPRESSION_ENGINE_STANDARD = 0x0000;
const ushort COMPRESSION_ENGINE_MAXIMUM = 0x100;
const ushort COMPRESSION_ENGINE_HIBER = 0x0200;

var sourceHex =
    "dfff01b004f00472006f006f007400ef000001f801f00961007600610069006c00610062006c006500ef000002f80211037a006f006f00f7f0056600" +
    "6c0061006d006500ef000003f80311132d0031003000360039003200330039003200340035002e003200300032003600390038003200f7f0066c0069" +
    "00760069006e006700ef000004f8041106670061007400680065007200f7f00b6700720061006e006400660061007400680065007200ef000005f805" +
    "f004740065006c006c00ef000006f806f0036d0061007000ef000007f807110b2d003100360031003800300037003400380031003400f7f00274006f" +
    "00ef000008f808110a3200300037003000360036003500300032003300f7f00661006d006f0075006e007400ef000009f8091103730069007800f7f0" +
    "04770065006c006c00ef00000af80a11046a006f0069006e00f7f00477006f0072006500ef00000bf80b11122d003900320030003100310031003400" +
    "390038002e003400320031003700310038003600f7f0087100750065007300740069006f006e00ef00000cf80c111231003700360032003100390031" +
    "00380034002e0038003400330038003500340034003300f7f7f0047200610069006e00ef00000df80d110866006c006f006100740069006e006700f7" +
    "f00763006f006d00700061006e007900ef00000ef80e11122d0032003100300032003300390030003000350035002e00330035003100310033003100" +
    "f7f00470006f006c006500ef00000ff80f1106620065007400740065007200f7f00566007200750069007400ef000010f810110661006d006f007500" +
    "6e007400f7f0067700650061006c0074006800ef000011f8111103610069006400f7f7f0076200720065006100740068006500ef000012f812110379" +
    "006f007500f7f00869006e00630072006500610073006500ef000013f81311122d00380032003800380032003500340032002e003400310037003000" +
    "3900330037003500f7f7";

var sourceBytes = Convert.FromHexString(sourceHex);

var compressed = Compress(sourceBytes, COMPRESSION_FORMAT_XPRESS_HUFF | COMPRESSION_EN
74677676777788775777000000000000000000000000700786677776770000000000000000000000000000000000000070777777780877778777868778080000000000000000000000000000000000000000000000000000080000000000000000000000000000000000000000000080000000000000008006000060040000800800000000000000870000000000000085070000000000007500000000000000778700000800000075870800000000008508008000000000568700000000000067768800000800008800000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
DECLARE @T TABLE
(
Id INT IDENTITY PRIMARY KEY  WITH (XML_COMPRESSION = ON),
TestXml XML
)

DECLARE @Xml XML = REPLICATE(N'<a>a</a>',41)
SELECT DATALENGTH(@Xml)

INSERT @T
VALUES (@Xml)

DECLARE @DynSQL NVARCHAR(MAX) = 
(
SELECT  DbccCommand = CONCAT('DBCC TRACEON(3604);DBCC PAGE (2, ', file_id ,', ', page_id ,', 3);DBCC TRACEOFF(3604)')
FROM @T
cross apply sys.fn_PhysLocCracker(%%physloc%%)
)

EXEC (@DynSQL);

Context

StackExchange Database Administrators Q#329939, answer score: 4

Revisions (0)

No revisions yet.