patternMinor
how does the shap algorithm work in polynomial time?
Viewed 0 times
shapthepolynomialtimealgorithmworkdoeshow
Problem
I'm trying to understand how the shap algorithm calculates in polynomial time an estimation to the feature attribution function that satisfies the shapely value attributes (specifically for tree based models!).
A simplified version of the feature attribution function for the feature $i$ of the model $f$ prediction for the input instance $x$ is: $\phi_i(f, x)=\sum_{S \in N /\{i\}}[f_x(S\bigcup{i})-f_x(S)]$ - i.e. sum the change in the model prediction for every set of features $S$ that does not include the feature $i$ from the same set when $i$ is introduced.
The naive exponential estimation for a tree based model (Algorithm 1 in the paper), estimates $f_x(S)$ by recursively following the decision path for $x$ if the split feature is in $S$, and taking the weighted average of both branches if the split feature is not in $S$ (weights are the coverage of train samples in that branch). We do that for every set $S$ in the powerset of features.
So far so good. But now the paper describes Algorithm 2 - a polynomial method for computing the same values, basically by "running simultaneously for all $2^N$ subsets"! I can't understand that, no matter how much I try (see my notebook).
Can someone enlighten me and explain how is that working? how are we tracking the number of sets and the coverage in every path?
Note that the real algorithm is much more complex than what I'm presenting here, as the real feature attribution function weights each set $S$ proportionally to the number of features in it $|S|$ - but I'm not interested in that complexity, only in the basic simplified setup I've presented here.
Thanks for your efforts, this is something that's eluding me for years
A simplified version of the feature attribution function for the feature $i$ of the model $f$ prediction for the input instance $x$ is: $\phi_i(f, x)=\sum_{S \in N /\{i\}}[f_x(S\bigcup{i})-f_x(S)]$ - i.e. sum the change in the model prediction for every set of features $S$ that does not include the feature $i$ from the same set when $i$ is introduced.
The naive exponential estimation for a tree based model (Algorithm 1 in the paper), estimates $f_x(S)$ by recursively following the decision path for $x$ if the split feature is in $S$, and taking the weighted average of both branches if the split feature is not in $S$ (weights are the coverage of train samples in that branch). We do that for every set $S$ in the powerset of features.
So far so good. But now the paper describes Algorithm 2 - a polynomial method for computing the same values, basically by "running simultaneously for all $2^N$ subsets"! I can't understand that, no matter how much I try (see my notebook).
Can someone enlighten me and explain how is that working? how are we tracking the number of sets and the coverage in every path?
Note that the real algorithm is much more complex than what I'm presenting here, as the real feature attribution function weights each set $S$ proportionally to the number of features in it $|S|$ - but I'm not interested in that complexity, only in the basic simplified setup I've presented here.
Thanks for your efforts, this is something that's eluding me for years
Solution
First, some notation.
A decision tree is a binary tree in which each internal node is labelled by one of $x_1,\ldots,x_n$, and has two outgoing edges labelled $0$ and $1$. Leaves are labelled with real numbers. To evaluate the decision tree on an input $(x_1,\ldots,x_n) \in \{0,1\}^n$, we just follow the tree from the root to a leaf, and output the value on the leaf. From now on, we fix a decision tree $T$, as well as a reference output $y$.
A decision tree is irredundant if no root-to-leaf path contains the same variable twice. We can easily make a tree irredundant in linear time, and so we assume that $T$ is irredundant. (For some reason, the paper doesn't make this assumption.)
Given a subset $S \subseteq [n]$, let $f(S)$ denote the expect output of $T$ when run on a random input $x$ subject to $x_S = y_S$ (i.e., $x_i = y_i$ for all $i \in S$). Alternatively, it is the expected value of the following process. Start at the root and proceed to a leaf. At any given node, if the node queries a variable $i \in S$, go to the branch labelled $y_i$. Otherwise, take a random branch. Thus $f([n])$ is the value of $T$ on $y$, and $f(\emptyset)$ is the value of $T$ on a random input.
For $i \in [n]$, we define the Shapley value $\phi(i)$ to be
$$
\phi(i) = \sum_{i \notin S} \frac{|S|!(n-|S|-1)!}{n!} (f(S \cup \{i\}) - f(S)).
$$
For the actual algorithm, the coefficients won't matter much. Indeed, we will calculate directly
$$
\alpha_+(i,\ell) = \sum_{\substack{i \in S \\ |S|=\ell}} f(S), \\
\alpha_-(i,\ell) = \sum_{\substack{i \notin S \\ |S|=\ell}} f(S),
$$
from which we can determine $\phi(i)$ using the formula. In the same way, we can compute other power indices, such as the Banzhaf index.
Let $v$ be any node in $T$. Define $\alpha_{\pm}(i,\ell,v,s_0,s_1)$ as the same value as above, when run on the decision tree rooted at $v$, given that we have already put $s_1$ nodes inside $S$, and $s_0$ nodes outside $S$. Thus we are interested in $\alpha_-(i,\ell,r,1,0)$ and $\alpha_+(i,\ell,r,0,1)$, where $r$ is the root of $T$.
If $v$ is a leaf then $\alpha_{\pm}(i,\ell,v,s_0,s_1)$ is just the label of $v$ multiplied by the number of possible sets $S$, which is $\binom{n-s_0-s_1}{\ell-s_1}$. Now suppose that $v$ is an internal node labelled with $x_j$ and having children $v_0,v_1$. If $j = i$ then
$$
\alpha_+(i,\ell,v,s_0,s_1) = \alpha_+(i,\ell,v_{y_i},s_0,s_1), \\
\alpha_-(i,\ell,v,s_0,s_1) = \frac{\alpha_-(i,\ell,v_0,s_0,s_1) + \alpha_-(i,\ell,v_1,s_0,s_1)}{2}.
$$
(In fact, it doesn't matter whether we choose $\alpha_+$ or $\alpha_-$ on the right-hand side.) If $j \neq i$ then
$$
\alpha_{\pm}(i,\ell,v,s_0,s_1) = \alpha_{\pm}(i,\ell,v_{y_j},s_0,s_1+1) + \frac{\alpha_{\pm}(i,\ell,v_0,s_0+1,s_1) + \alpha_{\pm}(i,\ell,v_1,s_0+1,s_1)}{2}.
$$
The actual algorithm in the paper is some sort of mild optimization of this idea.
A decision tree is a binary tree in which each internal node is labelled by one of $x_1,\ldots,x_n$, and has two outgoing edges labelled $0$ and $1$. Leaves are labelled with real numbers. To evaluate the decision tree on an input $(x_1,\ldots,x_n) \in \{0,1\}^n$, we just follow the tree from the root to a leaf, and output the value on the leaf. From now on, we fix a decision tree $T$, as well as a reference output $y$.
A decision tree is irredundant if no root-to-leaf path contains the same variable twice. We can easily make a tree irredundant in linear time, and so we assume that $T$ is irredundant. (For some reason, the paper doesn't make this assumption.)
Given a subset $S \subseteq [n]$, let $f(S)$ denote the expect output of $T$ when run on a random input $x$ subject to $x_S = y_S$ (i.e., $x_i = y_i$ for all $i \in S$). Alternatively, it is the expected value of the following process. Start at the root and proceed to a leaf. At any given node, if the node queries a variable $i \in S$, go to the branch labelled $y_i$. Otherwise, take a random branch. Thus $f([n])$ is the value of $T$ on $y$, and $f(\emptyset)$ is the value of $T$ on a random input.
For $i \in [n]$, we define the Shapley value $\phi(i)$ to be
$$
\phi(i) = \sum_{i \notin S} \frac{|S|!(n-|S|-1)!}{n!} (f(S \cup \{i\}) - f(S)).
$$
For the actual algorithm, the coefficients won't matter much. Indeed, we will calculate directly
$$
\alpha_+(i,\ell) = \sum_{\substack{i \in S \\ |S|=\ell}} f(S), \\
\alpha_-(i,\ell) = \sum_{\substack{i \notin S \\ |S|=\ell}} f(S),
$$
from which we can determine $\phi(i)$ using the formula. In the same way, we can compute other power indices, such as the Banzhaf index.
Let $v$ be any node in $T$. Define $\alpha_{\pm}(i,\ell,v,s_0,s_1)$ as the same value as above, when run on the decision tree rooted at $v$, given that we have already put $s_1$ nodes inside $S$, and $s_0$ nodes outside $S$. Thus we are interested in $\alpha_-(i,\ell,r,1,0)$ and $\alpha_+(i,\ell,r,0,1)$, where $r$ is the root of $T$.
If $v$ is a leaf then $\alpha_{\pm}(i,\ell,v,s_0,s_1)$ is just the label of $v$ multiplied by the number of possible sets $S$, which is $\binom{n-s_0-s_1}{\ell-s_1}$. Now suppose that $v$ is an internal node labelled with $x_j$ and having children $v_0,v_1$. If $j = i$ then
$$
\alpha_+(i,\ell,v,s_0,s_1) = \alpha_+(i,\ell,v_{y_i},s_0,s_1), \\
\alpha_-(i,\ell,v,s_0,s_1) = \frac{\alpha_-(i,\ell,v_0,s_0,s_1) + \alpha_-(i,\ell,v_1,s_0,s_1)}{2}.
$$
(In fact, it doesn't matter whether we choose $\alpha_+$ or $\alpha_-$ on the right-hand side.) If $j \neq i$ then
$$
\alpha_{\pm}(i,\ell,v,s_0,s_1) = \alpha_{\pm}(i,\ell,v_{y_j},s_0,s_1+1) + \frac{\alpha_{\pm}(i,\ell,v_0,s_0+1,s_1) + \alpha_{\pm}(i,\ell,v_1,s_0+1,s_1)}{2}.
$$
The actual algorithm in the paper is some sort of mild optimization of this idea.
Context
StackExchange Computer Science Q#133247, answer score: 3
Revisions (0)
No revisions yet.