HiveBrain v1.2.0
Get Started
← Back to all entries
snippetsqlMinor

How can I efficiently traverse graph data with this pattern?

Submitted by: @import:stackexchange-dba··
0
Viewed 0 times
thiscanefficientlygraphwithtraversehowdatapattern

Problem

I have some relations embodying a directed acyclic graph that includes patterns similar to the following:

I'm looking for an efficient way to traverse this graph data. Here is an example of the seemingly simple task of counting descendants of node 0:

db<>fiddle

DROP TABLE IF EXISTS #edges;
CREATE TABLE #edges(tail int, head int);
INSERT INTO #edges(tail,head) VALUES
    (0,1), (5, 6),  (10,11), (15,16),
    (0,2), (5, 7),  (10,12), (15,17),
    (1,2), (6, 7),  (11,12), (16,17),
    (1,3), (7, 8),  (11,13), (17,18),
    (2,3), (7, 9),  (12,13), (17,19),
    (2,4), (8, 9),  (12,14), (18,19),
    (3,4), (8,10),  (13,14),
    (3,5), (9,10),  (13,15),
    (4,5), (9,11),  (14,15),
    (4,6),          (14,16);
WITH descendents(node)
AS(
    SELECT 0 as node
    UNION ALL
    SELECT head as node FROM descendents as prior
    JOIN #edges ON prior.node = tail
)
SELECT
    (SELECT COUNT(node) FROM descendents)             as total_nodes,
    (SELECT COUNT(node) FROM
        (SELECT DISTINCT node FROM descendents) as d) as distinct_nodes


This results in the following:

total_nodes | distinct_nodes
      10512 |             20


Every path is visited instead of each node once

total_nodes seems to grow at about 2^n where n is the number of nodes in the example. This is because every possible path is traversed rather than each node once. This n=29 example results in 1,305,729 total_nodes and took 75 seconds to complete on my local instance of SQL Server Express.

The obvious strategy is to only visit previously unvisited nodes on each iteration.
Excluding redundant additions using WHERE...NOT IN does not seem to be supported

The most direct method of preventing visits to previously visited nodes would seem to be to filter right in the recursive member of the CTE as follows:

SELECT head as node FROM descendents as prior
JOIN #edges ON prior.node = tail
WHERE head NOT IN (SELECT node from descendents)


This produces the error "Recursive member

Solution

The problem really is, your graph image is not exactly one to one with your example #edges data. In your graph image, node 2 is unique. In the #edges table, node 2 exists twice because it is a child of node 0 and node 1. So this actually implicitly ends up breaking your graph into a tree instead, with node 2 duplicated.

  • Is there a more efficient way to traverse the graph in the above example using recursive CTEs?



Is the current implementation unacceptably inefficient? Just because it's traversing duplicate nodes doesn't necessarily mean the process to get there was inefficient. What does the execution plan show happened under the hood? What does the TIME and IO STATISTICS say in regards to runtime computation? I've generally found recursive CTEs to be rather performant even when iterating over decently sized collections of data.

For cyclic graphs, if you wanted to visit each node only once, then you can workaround the redundancy issue by keeping track of nodes that were already visited, on each iteration of the recursion. One way to do that is by building a string of already visited nodes with a separator and doing a wildcard contains search, like such:

WITH descendents(node)
AS(
    SELECT 0 as node, CONVERT(VARCHAR(MAX), '0') AS NodesVisited
    UNION ALL
    SELECT head as node, CONCAT(prior.NodesVisited, '|', head) AS NodesVisited
    FROM descendents as prior
    JOIN #edges ON prior.node = tail
    WHERE prior.NodesVisited NOT LIKE CONCAT('%|', #edges.head, '|%')
)
...


But this methodology can be inefficient in itself because there is overhead to doing the wildcard contains search in the WHERE clause on each iteration. This coupled with the NOT LIKE clause also makes the predicate non-sargable which effectively means it can't take use of an index to efficiently serve that predicate. Depending on the size of the #edges data, that could also impact the performance negatively.

A slightly alternative option (assuming you're on SQL Server 2016 or newer) is to build the string of visited nodes, then use CROSS APPLY with the built-in STRING_SPLIT() function to join to the collection of visited nodes as a dataset, as opposed to using a wildcard contains search. This too can be resource intensive though (from the repeated calls to the STRING_SPLIT() function).

Unfortunately, neither of these solutions apply to your problem with your acyclic graph, because your problem isn't with cyclic descendent chains, rather it's siblings with duplicate children (as discussed in the comments).

At the end of the day, you're going to want to look at the execution plan and runtime statistics, as previously mentioned, to determine how performant each implementation is, and compare them.

  • stored procedure that implements recursion programmatically



  • hierarchyid



  • SQL Graph



  • Of the above alternatives, what are the tradeoffs?



The stored procedure route will likely be less efficient than your recursive CTE implementation. It's more of a procedural code approach as opposed to a relational one. I've converted recursive stored procedures that took over 1 hour to compute down to under 1 second by implementing them as recursive CTEs.

The hierarchyid data type I have no experience with, but I would guess you would still need to implement a recursive-based solution to get the end results you're after, either way. I don't believe this data type is commonly used.

Using the SQL Graph feature would be interesting. I only have minimal experience using that as well. But again, I think it's a much less utilized feature of SQL Server, and so there's probably less information out there on how to properly use it efficiently.

  • Are there any other alternatives I should consider?



Not that I can think of at the moment. The most common solution is a recursive CTE, generally.

Code Snippets

WITH descendents(node)
AS(
    SELECT 0 as node, CONVERT(VARCHAR(MAX), '0') AS NodesVisited
    UNION ALL
    SELECT head as node, CONCAT(prior.NodesVisited, '|', head) AS NodesVisited
    FROM descendents as prior
    JOIN #edges ON prior.node = tail
    WHERE prior.NodesVisited NOT LIKE CONCAT('%|', #edges.head, '|%')
)
...

Context

StackExchange Database Administrators Q#327305, answer score: 2

Revisions (0)

No revisions yet.