patternMinor

Speeding up OpenCL matrix-vector multiplication

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

multiplicationopenclspeedingmatrixvector

Problem

So I'd like to get a performance boost above and beyond standard Julia matrix-vector multiply, using my Intel HD Graphics 4000 1536 MB GPU, but I can't do better than an order of magnitude worse performance.

The kernel I'm using is based on this approach. I don't know how ArrayFire gets such fast speeds, they're clearly using some sort of black magic, but I don't know what it is. Anyways here is the test code:

using OpenCL
const cl = OpenCL

function mvmulJulia(M::Int32, N::Int32)
    srand(1)
    A = rand(Float32, M, N)
    x = rand(Float32, N)
    t = @elapsed A * x
    println(t, " seconds for Julia")
    nothing
end

function mvmulGPU(M::Int32, N::Int32, P::Int32)
    @assert N % P == 0

    srand(1)
    TPG = div(N, P)
    A = rand(Float32, M, N)
    x = rand(Float32, N)

    device, ctx, queue = cl.create_compute_context()
    ctx = cl.Context(device)
    queue = cl.CmdQueue(ctx, :profile)

    A_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=A)
    x_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=x)
    y_buff = cl.Buffer(Float32, ctx, :w, M)

    const mvmulkernel = """
    kernel void mvmul(int M,
                        int N,
                        int P,
                        int TPG,
                        const global float *A,
                        const global float *x,
                        global float *y)
        {   
            int i = get_global_id(0);
            int j = get_global_id(1);
            int tpg = get_local_id(1);

            local float sums[$(TPG)];
            float sum = 0.0f;

            for (int p=0; p cl.build!
    kernel = cl.Kernel(program, "mvmul")
    evt = cl.call(queue, kernel, (M, N), (1, P), M, N, P, TPG, A_buff, x_buff, y_buff)

    t = round(evt[:profile_duration] * 1e-9, 6)
    println(t, " seconds on GPU")
    y = cl.read(queue, y_buff)
    println(isapprox(y, A * x))
    nothing
end

M = Int32(4000)
N = Int32(300)
P = Int32(50)

mvmulJulia(M, N)
mvmulGPU(M, N, P)

You can try out

Solution

Ok, so I figured out what I was doing wrong. Basically I was completely misunderstanding the way the work-groups and work-items were supposed to break up.

What I have in the code in my original post is 1 thread for each element of the matrix A, and then I broke each row of this matrix up into work-groups of size P.

What I was supposed to do instead was have P threads per row, which is a total of (M, P) threads, and then collect each row into a single work-group (of size P). So instead of having N/P work-groups of size P per row, I now only have a single work-group of size P for each row. Hopefully that makes sense to everyone.

So here is the corrected code. I didn't put it into functions this time, so just run the script as is. The script includes both mvmul1 and mvmul2 from that website I linked to in my original post.

using OpenCL
const cl = OpenCL
srand(1)

M = Int32(30)
N = Int32(300)
P = Int32(30)
A = rand(Float32, M, N)
x = rand(Float32, N)

mvmul1 = """
kernel void mvmul1(int M,
                        int N,
                        const global float *A,
                        const global float *x,
                        global float *y)
    {   
        int i = get_global_id(0);
        float acc = 0.0f;

        for (int j=0; j cl.build!
kernel = cl.Kernel(program, "mvmul1")
evt = cl.call(queue, kernel, M, nothing, M, N, A_buff, x_buff, y_buff)

tout = round(evt[:profile_duration] * 1e-9, 6)
yout = cl.read(queue, y_buff)
t = @elapsed y = A * x
println("mvmul1 is ", round(t / tout, 3), " times as fast as Julia. ", isapprox(yout, y))

"""mvmul2"""

A_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=A)
x_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=x)
y_buff = cl.Buffer(Float32, ctx, :w, M)

program = cl.Program(ctx, source=mvmul2) |> cl.build!
kernel = cl.Kernel(program, "mvmul2")
evt = cl.call(queue, kernel, (M, P), (1, P), M, N, P, A_buff, x_buff, y_buff)

tout = round(evt[:profile_duration] * 1e-9, 6)
yout = cl.read(queue, y_buff)
t = @elapsed y = A * x
println("mvmul2 is ", round(t / tout, 3), " times as fast as Julia. ", isapprox(yout, y))

You'll notice that for M = N, then mvmul1 does well while mvmul2 doesn't, interesting

Code Snippets

using OpenCL
const cl = OpenCL
srand(1)

M = Int32(30)
N = Int32(300)
P = Int32(30)
A = rand(Float32, M, N)
x = rand(Float32, N)

mvmul1 = """
kernel void mvmul1(int M,
                        int N,
                        const global float *A,
                        const global float *x,
                        global float *y)
    {   
        int i = get_global_id(0);
        float acc = 0.0f;

        for (int j=0; j<N; j++)
        {
            acc += A[M * j + i] * x[j];
        }

        y[i] = acc;
    }
"""

mvmul2 = """
kernel void mvmul2(int M,
                        int N,
                        int P,
                        const global float *A,
                        const global float *x,
                        global float *y)
    {   
        int i = get_global_id(0);
        int j = get_global_id(1);

        local float sums[$(P)];
        float sum = 0.0f;

        for (int q=0; q<(N / P); q++)
        {
            sum += A[M * (j + P * q) + i] * x[j + P * q];
        }

        sums[j] = sum;
        barrier(CLK_LOCAL_MEM_FENCE);

        if (j == 0)
        {
            float sumtotal = 0.0f;
            for (int p=0; p<P; p++)
            {
                sumtotal += sums[p];
            }

            y[i] = sumtotal;
        }
    }
"""


device, ctx, queue = cl.create_compute_context()
ctx = cl.Context(device)
queue = cl.CmdQueue(ctx, :profile)


"""mvmul1"""

A_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=A)
x_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=x)
y_buff = cl.Buffer(Float32, ctx, :w, M)

program = cl.Program(ctx, source=mvmul1) |> cl.build!
kernel = cl.Kernel(program, "mvmul1")
evt = cl.call(queue, kernel, M, nothing, M, N, A_buff, x_buff, y_buff)

tout = round(evt[:profile_duration] * 1e-9, 6)
yout = cl.read(queue, y_buff)
t = @elapsed y = A * x
println("mvmul1 is ", round(t / tout, 3), " times as fast as Julia. ", isapprox(yout, y))



"""mvmul2"""

A_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=A)
x_buff = cl.Buffer(Float32, ctx, (:r, :copy), hostbuf=x)
y_buff = cl.Buffer(Float32, ctx, :w, M)

program = cl.Program(ctx, source=mvmul2) |> cl.build!
kernel = cl.Kernel(program, "mvmul2")
evt = cl.call(queue, kernel, (M, P), (1, P), M, N, P, A_buff, x_buff, y_buff)

tout = round(evt[:profile_duration] * 1e-9, 6)
yout = cl.read(queue, y_buff)
t = @elapsed y = A * x
println("mvmul2 is ", round(t / tout, 3), " times as fast as Julia. ", isapprox(yout, y))

Context

StackExchange Code Review Q#134917, answer score: 2

Revisions (0)

No revisions yet.