patternsqlModerate
PostgreSQL Recursive Descendant Depth
Viewed 0 times
postgresqlrecursivedepthdescendant
Problem
I need to calculate the depth of a descendant from it's ancestor. When a record has
I do not control the data or the columns. The data and table schema comes from an external source. The table is growing continuously. Right now by about 30k records per day. Any node in the tree can be missing and they will be pulled from an external source at some point. They are usually pulled in
We initially had a code solution to this problem, but now having 5M+ rows, it takes almost 30 minutes to complete.
Example table definition and test data:
Note that
Running a query like this:
```
WITH RECURSIVE descendants(id, customer_id, object_id, parent_id, ancestor_id, depth) AS (
SELECT id, customer_id, object_id, parent_id, ancestor_id, 0
FROM objects
WHERE object_id = parent_id
UNION
SELECT o.id, o.customer_id, o.object_id, o.parent_id, o.ancestor_id, d.depth + 1
FROM objects o
INNER JOIN descendants d ON d.parent_id = o.object_id
WHERE
d.id <> o.id
AND
d.customer_id = o.customer_id
) SELECT * FROM descend
object_id = parent_id = ancestor_id, it is considered a root node (the ancestor). I have been trying to get a WITH RECURSIVE query running with PostgreSQL 9.4.I do not control the data or the columns. The data and table schema comes from an external source. The table is growing continuously. Right now by about 30k records per day. Any node in the tree can be missing and they will be pulled from an external source at some point. They are usually pulled in
created_at DESC order but the data is pulled with asynchronous background jobs.We initially had a code solution to this problem, but now having 5M+ rows, it takes almost 30 minutes to complete.
Example table definition and test data:
CREATE TABLE objects (
id serial NOT NULL PRIMARY KEY,
customer_id integer NOT NULL,
object_id integer NOT NULL,
parent_id integer,
ancestor_id integer,
generation integer NOT NULL DEFAULT 0
);
INSERT INTO objects(id, customer_id , object_id, parent_id, ancestor_id, generation)
VALUES (2, 1, 2, 1, 1, -1), --no parent yet
(3, 2, 3, 3, 3, -1), --root node
(4, 2, 4, 3, 3, -1), --depth 1
(5, 2, 5, 4, 3, -1), --depth 2
(6, 2, 6, 5, 3, -1), --depth 3
(7, 1, 7, 7, 7, -1), --root node
(8, 1, 8, 7, 7, -1), --depth 1
(9, 1, 9, 8, 7, -1); --depth 2Note that
object_id is not unique, but the combination (customer_id, object_id) is unique.Running a query like this:
```
WITH RECURSIVE descendants(id, customer_id, object_id, parent_id, ancestor_id, depth) AS (
SELECT id, customer_id, object_id, parent_id, ancestor_id, 0
FROM objects
WHERE object_id = parent_id
UNION
SELECT o.id, o.customer_id, o.object_id, o.parent_id, o.ancestor_id, d.depth + 1
FROM objects o
INNER JOIN descendants d ON d.parent_id = o.object_id
WHERE
d.id <> o.id
AND
d.customer_id = o.customer_id
) SELECT * FROM descend
Solution
The query you have is basically correct. The only mistake is in the second (recursive) part of the CTE where you have:
It should be the other way around:
You want to join the objects with their parents (that have already been found).
So the query that calculates depth can be written (nothing else changed, only formatting):
For the update, you simply replace the last
Tested on SQLfiddle
Additional comments:
The
The following is an attempt to address these issues: improve the CTE as to consider as few rows as possible and use
It can be used as the 1st update or a subsequent:
Note how the CTE has 3 parts. The first two are the stable parts. The 1st part find the root nodes that haven't been updated before and have still
The 3rd, recursive part, finds all the descendants of the first two parts, as before.
Tested on SQLfiddle-2
INNER JOIN descendants d ON d.parent_id = o.object_idIt should be the other way around:
INNER JOIN descendants d ON d.object_id = o.parent_idYou want to join the objects with their parents (that have already been found).
So the query that calculates depth can be written (nothing else changed, only formatting):
-- calculate generation / depth, no updates
WITH RECURSIVE descendants
(id, customer_id, object_id, parent_id, ancestor_id, depth) AS
AS ( SELECT id, customer_id, object_id, parent_id, ancestor_id, 0
FROM objects
WHERE object_id = parent_id
UNION ALL
SELECT o.id, o.customer_id, o.object_id, o.parent_id, o.ancestor_id, d.depth + 1
FROM objects o
INNER JOIN descendants d ON d.customer_id = o.customer_id
AND d.object_id = o.parent_id
WHERE d.id <> o.id
)
SELECT *
FROM descendants d
ORDER BY id ;For the update, you simply replace the last
SELECT, with the UPDATE, joining the result of the cte, back to the table:-- update nodes
WITH RECURSIVE descendants
-- nothing changes here except
-- ancestor_id and parent_id
-- which can be omitted form the select lists
)
UPDATE objects o
SET generation = d.depth
FROM descendants d
WHERE o.id = d.id
AND o.generation = -1 ; -- skip unnecessary updatesTested on SQLfiddle
Additional comments:
- the
ancestor_idand theparent_idare not needed to be in the select list (ancestor is obvious, parent a bit tricky to figure out why), so you can keep them in theSELECTquery if you want but you can safely remove them from theUPDATE.
- the
(customer_id, object_id)seems like a candidate for aUNIQUEconstraint. If your data comply with this, add such a constraint. The joins performed in the recursive CTE would not make sense if it wasn't unique (a node could have 2 parents otherwise).
- if you add that constraint, the
(customer_id, parent_id)would be a candidate for aFOREIGN KEYconstraint thatREFERENCESthe (unique)(customer_id, object_id). You most probably do not want to add that FK constraint though, since by your description, you are adding new rows and some rows can reference others that haven't been yet added.
- There are certainly problems with the efficiency of the query, if it's going to be performed in a big table. Not in the first run, as almost the whole table will be updated anyway. But the second time, you'll want only new rows (and those that were not touched by the 1st run) to be considered for update. The CTE as it is will have to build a big result.
The
AND o.generation = -1 in the final update will make sure that the rows that were updated in the 1st run will not be updated again but the CTE is still an expensive part. The following is an attempt to address these issues: improve the CTE as to consider as few rows as possible and use
(customer_id, obejct_id) instead of (id) to identify rows (so id is completely removed from the query. It can be used as the 1st update or a subsequent:
WITH RECURSIVE descendants
(customer_id, object_id, depth)
AS ( SELECT customer_id, object_id, 0
FROM objects
WHERE object_id = parent_id
AND generation = -1
UNION ALL
SELECT o.customer_id, o.object_id, p.generation + 1
FROM objects o
JOIN objects p ON p.customer_id = o.customer_id
AND p.object_id = o.parent_id
AND p.generation > -1
WHERE o.generation = -1
UNION ALL
SELECT o.customer_id, o.object_id, d.depth + 1
FROM objects o
INNER JOIN descendants d ON o.customer_id = d.customer_id
AND o.parent_id = d.object_id
WHERE o.parent_id <> o.object_id
AND o.generation = -1
)
UPDATE objects o
SET generation = d.depth
FROM descendants d
WHERE o.customer_id = d.customer_id
AND o.object_id = d.object_id
AND o.generation = -1 -- this is not really neededNote how the CTE has 3 parts. The first two are the stable parts. The 1st part find the root nodes that haven't been updated before and have still
generation=-1 so they must be newly added nodes. The 2nd part finds children (with generation=-1) of parent nodes that have previously been updated.The 3rd, recursive part, finds all the descendants of the first two parts, as before.
Tested on SQLfiddle-2
Code Snippets
INNER JOIN descendants d ON d.parent_id = o.object_idINNER JOIN descendants d ON d.object_id = o.parent_id-- calculate generation / depth, no updates
WITH RECURSIVE descendants
(id, customer_id, object_id, parent_id, ancestor_id, depth) AS
AS ( SELECT id, customer_id, object_id, parent_id, ancestor_id, 0
FROM objects
WHERE object_id = parent_id
UNION ALL
SELECT o.id, o.customer_id, o.object_id, o.parent_id, o.ancestor_id, d.depth + 1
FROM objects o
INNER JOIN descendants d ON d.customer_id = o.customer_id
AND d.object_id = o.parent_id
WHERE d.id <> o.id
)
SELECT *
FROM descendants d
ORDER BY id ;-- update nodes
WITH RECURSIVE descendants
-- nothing changes here except
-- ancestor_id and parent_id
-- which can be omitted form the select lists
)
UPDATE objects o
SET generation = d.depth
FROM descendants d
WHERE o.id = d.id
AND o.generation = -1 ; -- skip unnecessary updatesWITH RECURSIVE descendants
(customer_id, object_id, depth)
AS ( SELECT customer_id, object_id, 0
FROM objects
WHERE object_id = parent_id
AND generation = -1
UNION ALL
SELECT o.customer_id, o.object_id, p.generation + 1
FROM objects o
JOIN objects p ON p.customer_id = o.customer_id
AND p.object_id = o.parent_id
AND p.generation > -1
WHERE o.generation = -1
UNION ALL
SELECT o.customer_id, o.object_id, d.depth + 1
FROM objects o
INNER JOIN descendants d ON o.customer_id = d.customer_id
AND o.parent_id = d.object_id
WHERE o.parent_id <> o.object_id
AND o.generation = -1
)
UPDATE objects o
SET generation = d.depth
FROM descendants d
WHERE o.customer_id = d.customer_id
AND o.object_id = d.object_id
AND o.generation = -1 -- this is not really neededContext
StackExchange Database Administrators Q#126181, answer score: 15
Revisions (0)
No revisions yet.