HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Mean and median distance in unweighted graph

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
graphmeanunweighteddistanceandmedian

Problem

I have a very large graph of ~7 million vertices and ~100 million edges. One dfs run in my current implementation runs in 30 seconds. The graph is an unweighted directed strongly connected graph.

I need to find the mean and median distances of the graph (median and mean distances over all shortest-path distances of the graph). I need the exact mean/median, not an approximation. This problem must be related to all-pairs shortest paths. But the only algorithms I know of for this problem are: Floyd Warshall, Dijkstra starting from each vertex, and Breadth First Search starting from each vertex. However, their running time is $\Omega(V^2)$ for all three of these algorithms, so for such a large graph they are too slow.

Can this problem be solved in linear or linearithmic time? If not, what is the fastest algorithm for this problem, in asymptotic complexity?

Solution

(This is not a full answer.)

The median distance is necessarily an integer [footnote 1]. Therefore, it might be possible to use an approximation algorithm for the median to compute the exact median, if the approximation is sufficiently precise.

Suppose the true median is $\xi$. Then if we can efficiently compute a $1+1/3\xi$-approximation to the median, we can reconstruct the exact median from that.

I suspect it is possible to get a $\tau$-approximation to the median efficiently through random sampling. In particular, consider the following procedure:

-
Repeat $k$ times: in the $i$th iteration, pick a random vertex $v_i$, compute single-source shortest paths from $v_i$ to every vertex, and find the distribution of these distances.

-
Aggregate these distributions and output the median of this aggregated distribution.

One might hope to get an approximation whose quality increases rapidly with $k$, and where if we want a constant approximation factor $\tau$, then a constant $k$ suffices. I have no proof that this is possible, though.

We can consider a simpler situation, where given an integer $x$ we are trying to approximately count the fraction of distances that are at most $x$. If the fraction of such distances is approximately $1/2$ (e.g., because $x$ is approximately the median), then one approach is random sampling: repeat $k$ times where we pick a random pair of vertices $u,v$ and compute the distance from $u$ to $v$ (say using BFS), then count what fraction of those $k$ distances are at most $x$. The additive error is typically about $1/2\sqrt{k}$; it exceeds $c/\sqrt{k}$ with probability exponentially small in $c$ (it's basically the probability of a Gaussian being more than $2c$ standard deviations away from the mean). Thus, we can get a $1+\epsilon$-approximation to this count using $O(1/\epsilon^2)$ iterations. Each iteration takes at most $O(E)$ time, so with $O(E/\epsilon^2)$ time we can get a $1+\epsilon$-approximation to this count. The probability that our approximation is wrong can be made exponentially small; an error probability of $1/2^{100}$ is so small as to be negligible in practice (e.g., it is smaller than the probability of a cosmic ray causing a bitflip error that causes an error in your code), and this will increase the running time by only a constant factor (say, 100, or something like that). In this case it might be sufficient to also take $\epsilon$ to be a small constant.

Intuitively, it feels like if we can get a good approximation to this count, it might be possible to extend this to a good approximation to the median. For instance, if we hypothesize that the median is $x$, we can count the fraction of distances that are $\le x-1$, the fraction that are $\le x$, and the fraction that are $\le x+1$. If $x$ is indeed the median, hopefully the first fraction will be noticeably smaller than $1/2$ and the latter fraction will be noticeably larger than $1/2$. I don't know how to turn this into an algorithm that I can prove will always work on all graphs, but I suspect this might work well enough in practice.

Computing the (exact) mean seems like it might be harder, as the mean isn't necessarily an integer.

Footnote 1: The median is either an integer or halfway between two integers. In the latter case, the rest of the answer carries through if you divide everything by two.

Context

StackExchange Computer Science Q#67063, answer score: 3

Revisions (0)

No revisions yet.