patternpythonMinor
Interview task to efficiently find compact representation of nodes in tree and find big O time
Viewed 0 times
efficientlynodestreetimeinterviewbigfindandrepresentationtask
Problem
I recently had an interview problem where I was asked to find a compact representation of nodes in a tree.
As an example, consider the following tree:
The task was, given a list of node values, find a compact list of node values that completely describes the list given. A node value being in the list implictly means that all it's children are in the list as well and viceversa.
As a few examples, for the tree given:
This was my recursive attempt in python (from memory so may be some mistakes/syntax errors), which would be called with the root of the tree first. I was told:
I said that this would run with complexity \$
As an example, consider the following tree:
A
/ | \
B C D
/ | \ | \
E F G H I
/\
J KThe task was, given a list of node values, find a compact list of node values that completely describes the list given. A node value being in the list implictly means that all it's children are in the list as well and viceversa.
As a few examples, for the tree given:
compact([A]) = [A]
compact([B, C, D, J]) = [A]
compact([J, K]) = [E]
compact([J, K, F, G]) = [B]
compact([E, F, G, H]) = [B, H]
compact([H, I, D]) = [D]This was my recursive attempt in python (from memory so may be some mistakes/syntax errors), which would be called with the root of the tree first. I was told:
- I don't need to check for empty trees
- I can assume that each node has basic tree methods (so can check it's value, if it's a leaf node etc).
- The non-compact list of nodes values is passed in through node_list and I was told that there would be no duplicates in this list and that each value in it would correspond to a node in the tree
def compacted_tree(node, node_list):
compact_list = []
if node.value in node_list:
#if this node is in the list then it is it's own consise representation
return [node.value]
allChildrenInCompactList = True
for child in node.get_children():
compact_list.extend(compacted_tree(child, node_list))
if (allChildrenInCompactList):
#only keep checking this if it's still true
allChildrenInCompactList = child.value in compact_list
if (allChildrenInCompactList and !node.isleaf()):
#all my children are in the list so the compact list should just be me
return [node.value]
return compact_listI said that this would run with complexity \$
Solution
This is, strictly speaking, off-topic for Code Review, because it's a question of algorithm design, not one of implementation. But I'll try to explain how I would approach it.
In order to describe an algorithm and its performance, we need some notation. So:
I'm going to assume that I can test for membership in a set in amortized time \$O(1)\$ and add or remove an element from a set in amortized time \$O(1)\$. This means we need to use Python's
The set \$M\$ might be redundant: namely it might contain nodes \$x\$ and \$y\$ such that \$y\$ is an ancestor of \$x\$ (that is, \$y = P(x)\$ or \$y = P(P(x))\$ or ...). All such \$x\$ need to be removed from \$M\$. For example, the set \$\{B, C, D, J\}\$ is redundant because \$B = P(P(J))\$ and so \$J\$ can be removed.
This can be done in one depth-first traversal over the tree:
This visits each node once and so takes time \$O(n)\$.
Any algorithm for this problem is going to have to find nodes all of whose children are in the set of nodes to be compacted (that is, to find nodes \$x\$ where \$C(x) ⊆ M\$), and update the set of nodes accordingly.
So the first algorithm that springs to mind is the simple one in which we repeatedly search the whole tree for a node \$x\$ matching that condition, stopping when we can't find any more such nodes. Formally:
What's the complexity of this algorithm? Step 1 takes time \$O(n)\$. Step 3 takes \$O(m)\$ (since \$M\$ stays the same or gets smaller as the algorithm progresses) and so does step 4 (for the same reason), so each loop 2–5 over the nodes in the tree takes time \$O(mn)\$, and we run that loop \$O(n)\$ times. The overall complexity is thus \$O(mn^2)\$.
How can we improve this algorithm? It would be nice not to have to restart the loop over the nodes in the tree each time we update \$M\$. The reason we restart the loop is because updating \$M\$ might make a compaction available that we already considered and rejected. So we have to start over from the beginning so that we don't miss any compactions.
But if we looked at the nodes in the right order: that is, if we always looked at the children of \$x\$ before looking at \$x\$, then this couldn't happen. This order of traversing a tree is known as post-order.
So the revised algorithm looks like this:
Here the loop 2–4 takes \$O(mn)\$ as before, but we only execute it once, for an overall complexity of \$O(mn)\$.
The step that is contributing the most to the overall complexity is the subset test \$C(x) ⊆ M\$, which is executed \$O(n)\$ times and costs \$O(m)\$. How could we speed this up? The insight here is that all we need to know is the number of items in \$C(x) ∩ M\$, because when \$|C(x) ∩ M| = |C(x)|\$ then \$C(x) ⊆ M\$. So we can maintain a count for each node, of the number of its children that are in \$M\$, and just compare the counts. This results in the following algorithm:
Here step 1 takes \$O(n)\$, step 2 takes \$O(m)\$, the loop 3–6 is executed \$O(n)\$ times, steps 4 and 6 take \$O(1)\$ each time they are executed, so \$O(n)\$ overall, and step 5 takes \$O(n)\$ overall since each node can be removed from \$M\$ at most once and added to \$M\$ at most once. So the overall complexity is \$O(n)\$.
- Notation
In order to describe an algorithm and its performance, we need some notation. So:
- \$n\$ is the number of nodes in the tree;
- \$M\$ is the set of nodes we want to compact;
- \$m = |M|\$, the initial number of nodes in \$M\$;
- \$C(x)\$ is the set of children of the node \$x\$ in the tree.
- \$P(x)\$ is the parent of the node \$x\$ in the tree.
I'm going to assume that I can test for membership in a set in amortized time \$O(1)\$ and add or remove an element from a set in amortized time \$O(1)\$. This means we need to use Python's
set data structure.- Eliminate redundancies
The set \$M\$ might be redundant: namely it might contain nodes \$x\$ and \$y\$ such that \$y\$ is an ancestor of \$x\$ (that is, \$y = P(x)\$ or \$y = P(P(x))\$ or ...). All such \$x\$ need to be removed from \$M\$. For example, the set \$\{B, C, D, J\}\$ is redundant because \$B = P(P(J))\$ and so \$J\$ can be removed.
This can be done in one depth-first traversal over the tree:
- Initialize \$k\$ to 0.
- On entering a node \$x ∈ M\$: if \$k = 0\$, set \$k = 1\$, otherwise remove \$x\$ from \$M\$.
- On leaving a node \$x ∈ M\$: set \$k = 0\$.
This visits each node once and so takes time \$O(n)\$.
- The naïve algorithm
Any algorithm for this problem is going to have to find nodes all of whose children are in the set of nodes to be compacted (that is, to find nodes \$x\$ where \$C(x) ⊆ M\$), and update the set of nodes accordingly.
So the first algorithm that springs to mind is the simple one in which we repeatedly search the whole tree for a node \$x\$ matching that condition, stopping when we can't find any more such nodes. Formally:
- Eliminate redundancies from \$M\$ as described above.
- For each non-leaf node \$x\$ in the tree:
- If \$C(x) ⊆ M\$ then:
- Update \$M\$ to \$(M \setminus C(x)) ∪ \{x\}\$
- Go to step 2.
What's the complexity of this algorithm? Step 1 takes time \$O(n)\$. Step 3 takes \$O(m)\$ (since \$M\$ stays the same or gets smaller as the algorithm progresses) and so does step 4 (for the same reason), so each loop 2–5 over the nodes in the tree takes time \$O(mn)\$, and we run that loop \$O(n)\$ times. The overall complexity is thus \$O(mn^2)\$.
- Looking at the nodes in the right order
How can we improve this algorithm? It would be nice not to have to restart the loop over the nodes in the tree each time we update \$M\$. The reason we restart the loop is because updating \$M\$ might make a compaction available that we already considered and rejected. So we have to start over from the beginning so that we don't miss any compactions.
But if we looked at the nodes in the right order: that is, if we always looked at the children of \$x\$ before looking at \$x\$, then this couldn't happen. This order of traversing a tree is known as post-order.
So the revised algorithm looks like this:
- Eliminate redundancies from \$M\$.
- For each non-leaf node \$x\$ in post-order:
- If \$C(x) ⊆ M\$ then:
- Update \$M\$ to \$(M \setminus C(x)) ∪ \{x\}\$
Here the loop 2–4 takes \$O(mn)\$ as before, but we only execute it once, for an overall complexity of \$O(mn)\$.
- Speeding up the subset test
The step that is contributing the most to the overall complexity is the subset test \$C(x) ⊆ M\$, which is executed \$O(n)\$ times and costs \$O(m)\$. How could we speed this up? The insight here is that all we need to know is the number of items in \$C(x) ∩ M\$, because when \$|C(x) ∩ M| = |C(x)|\$ then \$C(x) ⊆ M\$. So we can maintain a count for each node, of the number of its children that are in \$M\$, and just compare the counts. This results in the following algorithm:
- Eliminate redundancies from \$M\$.
- Initialize a map \$K\$ from nodes to non-negative integers such that \$K(x) = |C(x) ∩ M|\$, the number of children of \$x\$ that are in the set of nodes to be compacted.
- For each non-leaf node \$x\$ in post-order:
- If \$K(x) = |C(x)|\$ then:
- Update \$M\$ to \$(M \setminus C(x)) ∪ \{x\}\$
- Update \$K(P(x))\$ to \$K(P(x)) + 1\$
Here step 1 takes \$O(n)\$, step 2 takes \$O(m)\$, the loop 3–6 is executed \$O(n)\$ times, steps 4 and 6 take \$O(1)\$ each time they are executed, so \$O(n)\$ overall, and step 5 takes \$O(n)\$ overall since each node can be removed from \$M\$ at most once and added to \$M\$ at most once. So the overall complexity is \$O(n)\$.
Context
StackExchange Code Review Q#61417, answer score: 4
Revisions (0)
No revisions yet.