snippetsqlMinor
How can I efficiently traverse graph data with this pattern?
Viewed 0 times
thiscanefficientlygraphwithtraversehowdatapattern
Problem
I have some relations embodying a directed acyclic graph that includes patterns similar to the following:
I'm looking for an efficient way to traverse this graph data. Here is an example of the seemingly simple task of counting descendants of node 0:
db<>fiddle
This results in the following:
Every path is visited instead of each node once
The obvious strategy is to only visit previously unvisited nodes on each iteration.
Excluding redundant additions using
The most direct method of preventing visits to previously visited nodes would seem to be to filter right in the recursive member of the CTE as follows:
This produces the error "Recursive member
I'm looking for an efficient way to traverse this graph data. Here is an example of the seemingly simple task of counting descendants of node 0:
db<>fiddle
DROP TABLE IF EXISTS #edges;
CREATE TABLE #edges(tail int, head int);
INSERT INTO #edges(tail,head) VALUES
(0,1), (5, 6), (10,11), (15,16),
(0,2), (5, 7), (10,12), (15,17),
(1,2), (6, 7), (11,12), (16,17),
(1,3), (7, 8), (11,13), (17,18),
(2,3), (7, 9), (12,13), (17,19),
(2,4), (8, 9), (12,14), (18,19),
(3,4), (8,10), (13,14),
(3,5), (9,10), (13,15),
(4,5), (9,11), (14,15),
(4,6), (14,16);
WITH descendents(node)
AS(
SELECT 0 as node
UNION ALL
SELECT head as node FROM descendents as prior
JOIN #edges ON prior.node = tail
)
SELECT
(SELECT COUNT(node) FROM descendents) as total_nodes,
(SELECT COUNT(node) FROM
(SELECT DISTINCT node FROM descendents) as d) as distinct_nodesThis results in the following:
total_nodes | distinct_nodes
10512 | 20Every path is visited instead of each node once
total_nodes seems to grow at about 2^n where n is the number of nodes in the example. This is because every possible path is traversed rather than each node once. This n=29 example results in 1,305,729 total_nodes and took 75 seconds to complete on my local instance of SQL Server Express.The obvious strategy is to only visit previously unvisited nodes on each iteration.
Excluding redundant additions using
WHERE...NOT IN does not seem to be supportedThe most direct method of preventing visits to previously visited nodes would seem to be to filter right in the recursive member of the CTE as follows:
SELECT head as node FROM descendents as prior
JOIN #edges ON prior.node = tail
WHERE head NOT IN (SELECT node from descendents)This produces the error "Recursive member
Solution
The problem really is, your graph image is not exactly one to one with your example
Is the current implementation unacceptably inefficient? Just because it's traversing duplicate nodes doesn't necessarily mean the process to get there was inefficient. What does the execution plan show happened under the hood? What does the
For cyclic graphs, if you wanted to visit each node only once, then you can workaround the redundancy issue by keeping track of nodes that were already visited, on each iteration of the recursion. One way to do that is by building a string of already visited nodes with a separator and doing a wildcard contains search, like such:
But this methodology can be inefficient in itself because there is overhead to doing the wildcard contains search in the
A slightly alternative option (assuming you're on SQL Server 2016 or newer) is to build the string of visited nodes, then use
Unfortunately, neither of these solutions apply to your problem with your acyclic graph, because your problem isn't with cyclic descendent chains, rather it's siblings with duplicate children (as discussed in the comments).
At the end of the day, you're going to want to look at the execution plan and runtime statistics, as previously mentioned, to determine how performant each implementation is, and compare them.
The stored procedure route will likely be less efficient than your recursive CTE implementation. It's more of a procedural code approach as opposed to a relational one. I've converted recursive stored procedures that took over 1 hour to compute down to under 1 second by implementing them as recursive CTEs.
The
Using the SQL Graph feature would be interesting. I only have minimal experience using that as well. But again, I think it's a much less utilized feature of SQL Server, and so there's probably less information out there on how to properly use it efficiently.
Not that I can think of at the moment. The most common solution is a recursive CTE, generally.
#edges data. In your graph image, node 2 is unique. In the #edges table, node 2 exists twice because it is a child of node 0 and node 1. So this actually implicitly ends up breaking your graph into a tree instead, with node 2 duplicated.- Is there a more efficient way to traverse the graph in the above example using recursive CTEs?
Is the current implementation unacceptably inefficient? Just because it's traversing duplicate nodes doesn't necessarily mean the process to get there was inefficient. What does the execution plan show happened under the hood? What does the
TIME and IO STATISTICS say in regards to runtime computation? I've generally found recursive CTEs to be rather performant even when iterating over decently sized collections of data.For cyclic graphs, if you wanted to visit each node only once, then you can workaround the redundancy issue by keeping track of nodes that were already visited, on each iteration of the recursion. One way to do that is by building a string of already visited nodes with a separator and doing a wildcard contains search, like such:
WITH descendents(node)
AS(
SELECT 0 as node, CONVERT(VARCHAR(MAX), '0') AS NodesVisited
UNION ALL
SELECT head as node, CONCAT(prior.NodesVisited, '|', head) AS NodesVisited
FROM descendents as prior
JOIN #edges ON prior.node = tail
WHERE prior.NodesVisited NOT LIKE CONCAT('%|', #edges.head, '|%')
)
...But this methodology can be inefficient in itself because there is overhead to doing the wildcard contains search in the
WHERE clause on each iteration. This coupled with the NOT LIKE clause also makes the predicate non-sargable which effectively means it can't take use of an index to efficiently serve that predicate. Depending on the size of the #edges data, that could also impact the performance negatively.A slightly alternative option (assuming you're on SQL Server 2016 or newer) is to build the string of visited nodes, then use
CROSS APPLY with the built-in STRING_SPLIT() function to join to the collection of visited nodes as a dataset, as opposed to using a wildcard contains search. This too can be resource intensive though (from the repeated calls to the STRING_SPLIT() function).Unfortunately, neither of these solutions apply to your problem with your acyclic graph, because your problem isn't with cyclic descendent chains, rather it's siblings with duplicate children (as discussed in the comments).
At the end of the day, you're going to want to look at the execution plan and runtime statistics, as previously mentioned, to determine how performant each implementation is, and compare them.
- stored procedure that implements recursion programmatically
- hierarchyid
- SQL Graph
- Of the above alternatives, what are the tradeoffs?
The stored procedure route will likely be less efficient than your recursive CTE implementation. It's more of a procedural code approach as opposed to a relational one. I've converted recursive stored procedures that took over 1 hour to compute down to under 1 second by implementing them as recursive CTEs.
The
hierarchyid data type I have no experience with, but I would guess you would still need to implement a recursive-based solution to get the end results you're after, either way. I don't believe this data type is commonly used.Using the SQL Graph feature would be interesting. I only have minimal experience using that as well. But again, I think it's a much less utilized feature of SQL Server, and so there's probably less information out there on how to properly use it efficiently.
- Are there any other alternatives I should consider?
Not that I can think of at the moment. The most common solution is a recursive CTE, generally.
Code Snippets
WITH descendents(node)
AS(
SELECT 0 as node, CONVERT(VARCHAR(MAX), '0') AS NodesVisited
UNION ALL
SELECT head as node, CONCAT(prior.NodesVisited, '|', head) AS NodesVisited
FROM descendents as prior
JOIN #edges ON prior.node = tail
WHERE prior.NodesVisited NOT LIKE CONCAT('%|', #edges.head, '|%')
)
...Context
StackExchange Database Administrators Q#327305, answer score: 2
Revisions (0)
No revisions yet.