HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Doing matrix multiplication with $\lceil n^3 / \log n \rceil$ processors in $2\log n$ steps by Brent's principle

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
principlemultiplicationwithlogprocessorsstepsdoingbrentrceilmatrix

Problem

On a parallel machine with $n$ processors we can compute the sum (or product, or the result of any associative operation) on $n$ numbers in $\log n$ steps. In the first step combine neighbors to get $n/2$ numbers left, then combine them so that $n/4$ numbers are left and so on until a single number, the result, is left which happens after $\lceil \log n \rceil$ steps. With this is it obvious that we can compute the matrix product of two $n \times n$ matrices $A = (A_{ij}), B = (B_{ij})$ with $n^3$ processors in $\log n + 1$ steps, in a first step compute all the $n^3$ products of the entries $A_{ik} \cdot B_{kj}$ for $i,j,k \in \{1,\ldots, n\}$. Then with $n^2$ processors compute the $n^2$ sums of the $n$ numbers $A_{i1}\cdot B_{1j} + \ldots + A_{ik}\cdot B_{kj}$. Hence we get by with $1 + \log n$ parallel steps.

Now the following argument is from C. Papadimitriou, Computational Complexity, page 361, to show that we get time $2\log n$ using just $n^2 / \log n$ processors (the argument used is called Brent's principle according to the book).


We compute the $n^3$ products not in a single step as before, but rather in $\log n$ ''shifts'' using $\lceil n^3 / \log \rceil$ processors at each shift. We use shifts of the same $\lceil n^3 / \log n \rceil$ processors to compute the first $\log \log n$ parallel addition steps, where more than $\frac{n^3}{\log n}$ processors would be ordinary needed. The total number of parallel steps is now no more than $2\log n$, with $\frac{n^3}{\log n}$ processors [...].

(those arguments could also be found on these slides).

But I get more parallel steps as $2\log n$.

1) First we compute the $n^3$ products $A_{ik}\cdot B_{kj}$ with $\log n$ shifts, hence we need $\log n$ parallel steps for this.

2) If we combine the results as written in the introduction, then in the $k$-th step we have $n/2^k$ numbers to combine. Hence if $2^k < \log n$ we do not have enough processors to do this in a single step, hence we have to ''shift'' agai

Solution

Yes, you are correct while all textbooks and lecture notes you and I have found so far are wrong in claiming "doing matrix multiplication with $\lceil n^3 / \log n \rceil$ processors in at most $2\log n$ steps by Brent's principle"

There is hardly anything valuable to add to your meticulous calculation. Anyway, here is an example that specializes your calculation.
An example

Let $n=16$. We have two $16\times16$ matrices, $(A_{ij})$ and $(B_{ij})$ with $\left\lceil \dfrac{n^3}{\log_2n}\right\rceil=\left\lceil\dfrac{16^3}{\log_2 16}\right\rceil=1024$ processors.

It takes 4 parallel steps to compute all the $n^3=4096$ products of the entries $P_{ijk}=A_{ik} \cdot B_{kj}$ for $i,j,k \in \{1,2,\cdots,16\}$ with $1024$ processors.

How to compute $n^2=256$ sums of the $n=16$ numbers $P_{i1j}+ \ldots + P_{i16j}$ for $i,j\in\{1,2,\cdots,16\}$?

We first compute $256\times 8=2024$ sums of two neighbors $S_{ikj}=P_{i(2k-1)j}+P_{i(2k)j}$ for $i,j\in\{1,2,\cdots,16\}$, $k\in\{1,2,\cdots, 8\}$. It takes 2 parallel steps for 1024 processors, where $2$ is $\dfrac{\dfrac{n^3}2}{\dfrac{n^3}{\log_2n}}=\dfrac{\log_2n}2>1$.

There are $256 \times 4 = 1024$, $256 \times 2 = 512$, $256 \times 1 = 256$ sums of two numbers to calculate, respectively. Each of them will take one parallel step because of the serial dependency. We need $3 =\log_2n-1$ parallel steps.

So we need $4 + 2 + 3 = 9$ parallel steps in total, which is bigger than $\lceil 2\log_2 n\rceil = 8$.
Corrected statement

Multiplication of two $n\times n$ matrices can be done in no more than $3\log_2 n+1$ parallel steps with $\lceil n^3 / \log_2 n \rceil$ processors by Brent's principle.
Brent's principle

Here is one version of Brent's principle that applies to current situation.

(Brent’s principle): An algorithm involving $t$ time steps and performing a total of $m$ operations can be executed by $p$ processors in no more than $t + \lceil(m – t) / p\rceil$ time steps in PRAM.

In particular, we have $m=n^3+ n^2(n-1)$ operations and $t=1+\lceil \log_2n\rceil$ time steps for matrix multiplication. If we have $p=\lceil n^3 / \log_2 n \rceil$ processors, the upper limit of parallel steps given by Brent's principle, $t + \lceil(m – t) / p\rceil$ is $3\lceil\log_2 n\rceil-1$ or $3\lceil\log_2 n\rceil$ or $3\lceil\log_2 n\rceil+1$.

Context

StackExchange Computer Science Q#105137, answer score: 2

Revisions (0)

No revisions yet.