HiveBrain v1.2.0
Get Started
← Back to all entries
patternsqlMinor

Why is SQL doing an index scan and seek with the same XML query structure?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
whyseekthescansamesqlwithquerydoingxml

Problem

I've set up a test to benchmark xml performance in SQL server.

Test Setup

  • One Million rows of data



  • XML defined Column with primary key



  • A primary XML index



  • A secondary XML index on the path



  • XML data is similar in format but with variable tag names in each document



Table and Index Design

CREATE TABLE [dbo].[xml_Test]
(
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [GUID] [varchar](50) NULL,
    [JSON_Data] [varchar](max) NULL,
    [XML_Data] [xml] NULL,
    CONSTRAINT [PK_xml_Test] PRIMARY KEY CLUSTERED ([ID] ASC)
);

ALTER TABLE [dbo].[xml_Test] ADD  CONSTRAINT [DF_xml_Test_GUID]  DEFAULT (newid()) FOR [GUID];
ALTER TABLE [dbo].[xml_Test] ADD  CONSTRAINT [PK_xml_Test] PRIMARY KEY CLUSTERED ([ID] ASC);

CREATE PRIMARY XML INDEX [PK_xml] ON [dbo].[xml_Test]
(   [XML_Data]);

CREATE XML INDEX [IX_xml_Path] ON [dbo].[xml_Test] 
(   [XML_Data]) 
USING XML INDEX [PK_xml] FOR PATH;


Sample XML Schema


  3812
  E3735046-1183-4A79-B8EE-806312B533D6",
  John
  Doe
  123-123-1234
  Toronto
  Ontario
  14325
  14326
  143257
  14328
  14329
  14330
  14331
  14332
  14333
  1
  14335
  14336
  1
  1


When I go to query the table with two different values, I get a very efficient seek for one result and a scan for the other that takes orders of magnitude longer to complete.

I'm unsure as to why the second query is not able to do a seek on the index. I think it may have something to do with the element not being in the index to start with but even then I don't know why a scan would be necessary unless the index really didn't index all paths.

Execution Plans via Paste the Plan

Query A

select ID, 'Query A'
from xml_test
where XML_Data.exist('(/data/Q.70.R.4)[1]')=1


Query B

select id, 'Query B'
from xml_test
where XML_Data.exist('(/data/Q.61.R.15)[1]')=1


As you can see, Query A takes 5ms to execute compared to almost 15s for query B.

What are the reasons that the same query format but with different input would yield such an ob

Solution

XML index in SQL Server is implemented as an internal table that is a persisted version of the node table that is much the same as the XML shredding functions produce.

One of the columns in the internal table is called hid and that column contains the value used in the seek for a path expression. The value is an ordpath value. When you create a path xml index you get a non clustered index with hid as the leading column.

The ordpath for a specific element name is generated when SQL Server inserts XML data to the table. To encode the path, the different ordpath values are concatenated.

Simplified the path X/Y/Z could be represented as 1.2.3 and Y/Z/Z as 2.3.3.

SQL Server has to have a way to generate the same ordpath value for the same element name. That could be done with some kind of dictionary or table that keeps track of all generated ordpath values for one XML index. I'm a bit fuzzy here because I have not found anything anywhere that says this is how it is done and I have not found where it is stored.

Inserting one row with this XML


  
    
  


will give you this content in the internal table with one extra column added by me to show the path expression encoded in hid

id      hid     pk1
0x              1    
0x58          1     Z
0x5AC0  Á€À€    1     Y/Z
0x5AD6  €Á€À€  1     X/Y/Z


You can look at the internal table using a DAC login.

What happens when you query for a path expression (and here I get a bit speculative again) is that the same dictionary/lookup table that is used to make sure the elements with the same name get the same ordpath value is used to find the hid to use for the index seek in the internal table. The path expression in a XQuery is always static so the hid value used for the query can be stored in the query plan and resused.

A query for a node that exists get a query with a seek on the hid where Î is the hid value to search for.


  
    
      
    
    
      
        
      
    
  


A query for a node that does not exist gives you a very different way of locating the rows.

select count(*)
from dbo.T
where T.X.exist('NodeDoesNotExist') = 1


The compute scalar uses some functions to calculate values that are checked against in the filter operator.

Some of the tests I did to get here can be found here. Note that you don't actually need a generated hid for a element in the index to get the seek plan, you just have to try to insert it.

As a side note, the rewriting of the query to use one way or the other to find the rows of a XML query is done before the query optimizer starts to do its job so if you get a seek or not also depends on cardinality on the hid column in the internal table.

Paul White mentioned something in chat about where the element names inserted are stored:

QNames are stored in sys.sysqnames. The XML algebrizer requires an
existing known QName to generate an optimized path transform. i.e. to
use a path index. Entries are added to sys.sysqnames in a system
transaction that does not feel user transactions

Code Snippets

<X>
  <Y>
    <Z />
  </Y>
</X>
id      hid     pk1
0x              1    
0x58          1     Z
0x5AC0  Á€À€    1     Y/Z
0x5AD6  €Á€À€  1     X/Y/Z
<SeekKeys>
  <Prefix ScanType="EQ">
    <RangeColumns>
      <ColumnReference Database="[yy]" Schema="[sys]" Table="[xml_index_nodes_658101385_256000]" 
                       Alias="[NodeThatExist:1]" Column="hid" />
    </RangeColumns>
    <RangeExpressions>
      <ScalarOperator ScalarString="'Î'">
        <Const ConstValue="'Î'" />
      </ScalarOperator>
    </RangeExpressions>
  </Prefix>
</SeekKeys>
select count(*)
from dbo.T
where T.X.exist('NodeDoesNotExist') = 1

Context

StackExchange Database Administrators Q#288872, answer score: 9

Revisions (0)

No revisions yet.