patternsqlMinor
Why is SQL doing an index scan and seek with the same XML query structure?
Viewed 0 times
whyseekthescansamesqlwithquerydoingxml
Problem
I've set up a test to benchmark xml performance in SQL server.
Test Setup
Table and Index Design
Sample XML Schema
When I go to query the table with two different values, I get a very efficient seek for one result and a scan for the other that takes orders of magnitude longer to complete.
I'm unsure as to why the second query is not able to do a seek on the index. I think it may have something to do with the element not being in the index to start with but even then I don't know why a scan would be necessary unless the index really didn't index all paths.
Execution Plans via Paste the Plan
Query A
Query B
As you can see, Query A takes 5ms to execute compared to almost 15s for query B.
What are the reasons that the same query format but with different input would yield such an ob
Test Setup
- One Million rows of data
- XML defined Column with primary key
- A primary XML index
- A secondary XML index on the path
- XML data is similar in format but with variable tag names in each document
Table and Index Design
CREATE TABLE [dbo].[xml_Test]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[GUID] [varchar](50) NULL,
[JSON_Data] [varchar](max) NULL,
[XML_Data] [xml] NULL,
CONSTRAINT [PK_xml_Test] PRIMARY KEY CLUSTERED ([ID] ASC)
);
ALTER TABLE [dbo].[xml_Test] ADD CONSTRAINT [DF_xml_Test_GUID] DEFAULT (newid()) FOR [GUID];
ALTER TABLE [dbo].[xml_Test] ADD CONSTRAINT [PK_xml_Test] PRIMARY KEY CLUSTERED ([ID] ASC);
CREATE PRIMARY XML INDEX [PK_xml] ON [dbo].[xml_Test]
( [XML_Data]);
CREATE XML INDEX [IX_xml_Path] ON [dbo].[xml_Test]
( [XML_Data])
USING XML INDEX [PK_xml] FOR PATH;Sample XML Schema
3812
E3735046-1183-4A79-B8EE-806312B533D6",
John
Doe
123-123-1234
Toronto
Ontario
14325
14326
143257
14328
14329
14330
14331
14332
14333
1
14335
14336
1
1
When I go to query the table with two different values, I get a very efficient seek for one result and a scan for the other that takes orders of magnitude longer to complete.
I'm unsure as to why the second query is not able to do a seek on the index. I think it may have something to do with the element not being in the index to start with but even then I don't know why a scan would be necessary unless the index really didn't index all paths.
Execution Plans via Paste the Plan
Query A
select ID, 'Query A'
from xml_test
where XML_Data.exist('(/data/Q.70.R.4)[1]')=1Query B
select id, 'Query B'
from xml_test
where XML_Data.exist('(/data/Q.61.R.15)[1]')=1As you can see, Query A takes 5ms to execute compared to almost 15s for query B.
What are the reasons that the same query format but with different input would yield such an ob
Solution
XML index in SQL Server is implemented as an internal table that is a persisted version of the node table that is much the same as the XML shredding functions produce.
One of the columns in the internal table is called hid and that column contains the value used in the seek for a path expression. The value is an ordpath value. When you create a path xml index you get a non clustered index with hid as the leading column.
The ordpath for a specific element name is generated when SQL Server inserts XML data to the table. To encode the path, the different ordpath values are concatenated.
Simplified the path X/Y/Z could be represented as 1.2.3 and Y/Z/Z as 2.3.3.
SQL Server has to have a way to generate the same ordpath value for the same element name. That could be done with some kind of dictionary or table that keeps track of all generated ordpath values for one XML index. I'm a bit fuzzy here because I have not found anything anywhere that says this is how it is done and I have not found where it is stored.
Inserting one row with this XML
will give you this content in the internal table with one extra column added by me to show the path expression encoded in hid
You can look at the internal table using a DAC login.
What happens when you query for a path expression (and here I get a bit speculative again) is that the same dictionary/lookup table that is used to make sure the elements with the same name get the same ordpath value is used to find the hid to use for the index seek in the internal table. The path expression in a XQuery is always static so the hid value used for the query can be stored in the query plan and resused.
A query for a node that exists get a query with a seek on the hid where
A query for a node that does not exist gives you a very different way of locating the rows.
The compute scalar uses some functions to calculate values that are checked against in the filter operator.
Some of the tests I did to get here can be found here. Note that you don't actually need a generated hid for a element in the index to get the seek plan, you just have to try to insert it.
As a side note, the rewriting of the query to use one way or the other to find the rows of a XML query is done before the query optimizer starts to do its job so if you get a seek or not also depends on cardinality on the hid column in the internal table.
Paul White mentioned something in chat about where the element names inserted are stored:
QNames are stored in sys.sysqnames. The XML algebrizer requires an
existing known QName to generate an optimized path transform. i.e. to
use a path index. Entries are added to sys.sysqnames in a system
transaction that does not feel user transactions
One of the columns in the internal table is called hid and that column contains the value used in the seek for a path expression. The value is an ordpath value. When you create a path xml index you get a non clustered index with hid as the leading column.
The ordpath for a specific element name is generated when SQL Server inserts XML data to the table. To encode the path, the different ordpath values are concatenated.
Simplified the path X/Y/Z could be represented as 1.2.3 and Y/Z/Z as 2.3.3.
SQL Server has to have a way to generate the same ordpath value for the same element name. That could be done with some kind of dictionary or table that keeps track of all generated ordpath values for one XML index. I'm a bit fuzzy here because I have not found anything anywhere that says this is how it is done and I have not found where it is stored.
Inserting one row with this XML
will give you this content in the internal table with one extra column added by me to show the path expression encoded in hid
id hid pk1
0x 1
0x58  1 Z
0x5AC0 Á€À€ 1 Y/Z
0x5AD6 €Á€À€ 1 X/Y/ZYou can look at the internal table using a DAC login.
What happens when you query for a path expression (and here I get a bit speculative again) is that the same dictionary/lookup table that is used to make sure the elements with the same name get the same ordpath value is used to find the hid to use for the index seek in the internal table. The path expression in a XQuery is always static so the hid value used for the query can be stored in the query plan and resused.
A query for a node that exists get a query with a seek on the hid where
Î is the hid value to search for.
A query for a node that does not exist gives you a very different way of locating the rows.
select count(*)
from dbo.T
where T.X.exist('NodeDoesNotExist') = 1The compute scalar uses some functions to calculate values that are checked against in the filter operator.
Some of the tests I did to get here can be found here. Note that you don't actually need a generated hid for a element in the index to get the seek plan, you just have to try to insert it.
As a side note, the rewriting of the query to use one way or the other to find the rows of a XML query is done before the query optimizer starts to do its job so if you get a seek or not also depends on cardinality on the hid column in the internal table.
Paul White mentioned something in chat about where the element names inserted are stored:
QNames are stored in sys.sysqnames. The XML algebrizer requires an
existing known QName to generate an optimized path transform. i.e. to
use a path index. Entries are added to sys.sysqnames in a system
transaction that does not feel user transactions
Code Snippets
<X>
<Y>
<Z />
</Y>
</X>id hid pk1
0x 1
0x58  1 Z
0x5AC0 Á€À€ 1 Y/Z
0x5AD6 €Á€À€ 1 X/Y/Z<SeekKeys>
<Prefix ScanType="EQ">
<RangeColumns>
<ColumnReference Database="[yy]" Schema="[sys]" Table="[xml_index_nodes_658101385_256000]"
Alias="[NodeThatExist:1]" Column="hid" />
</RangeColumns>
<RangeExpressions>
<ScalarOperator ScalarString="'Î'">
<Const ConstValue="'Î'" />
</ScalarOperator>
</RangeExpressions>
</Prefix>
</SeekKeys>select count(*)
from dbo.T
where T.X.exist('NodeDoesNotExist') = 1Context
StackExchange Database Administrators Q#288872, answer score: 9
Revisions (0)
No revisions yet.