patternMinor

Computing the Fourier transform of a distribution of electron charge (represented by atomic positions in space)

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

representedpositionsfourierthespaceelectronchargetransformdistributionatomic

Problem

I am computing a nested loop operation using OpenCL (open computing language). My main question is, given the code outlined below, how might I optimize the speed and efficiency of using the GPU, for instance by using local memory calls instead of global ones?

Given some input vector \$V=(v_x, v_y, v_z)\$ (in math terms, the input vector is a point in reciprocal space where we evaluate the Fourier transform), the desired computation involves iterating over a collection of atoms in space (of different species, like Hydrogen, Oxygen, etc.) and using the input vector to compute a function. Each atom is described by a 4-tuple, consisting of 3 spatial coordinates, \$A=(a_x, a_y, a_z)\$, and 1 species identifier, species_id, which ranges from 0 to the number of unique atom species.

The iterations look like this (in pseudocode):

for V in input_vectors:
    ft_magnitude = 0
    for (A, species_id) in atoms:
        ft_magnitude += get_scale_factor(V, species_id) * exp( -i * dot(A,V))

where i in the exponential is the complex number \$i=\sqrt{-1}\$, the term dot(A,V) is the dot product of two vectors \$A \cdot V = a_x \cdot v_x + a_y \cdot v_y + a_z \cdot v_z\$, and get_scale_factor(V,species_id) is a lookup operation which returns the correct, pre-computed scale factor, which changes with each vector V and species_id.

The idea of using a GPU for this problem is to let each GPU worker compute the output for one vector V. I wrote a kernel posted below to be used with OpenCL, after following several examples I saw online.

The inputs to the kernel are the following:

input_vectors which has length 3*n_vectors, 3 per input vector

atoms which has length 4*n_atoms, 4 per atom

scale_factors which has length n_species * n_vectors , 1 per atom species per input vector

outputs which has length 2*n_vectors, 1 complex number per input vector, representing the Fourier transform magnitude

Here is the kernel (note the use of cosine an

Solution

The most immediate optimization is to use coalesced reads by switching to array of structs form to structure of arrays.

Context

StackExchange Code Review Q#159748, answer score: 2

Revisions (0)

No revisions yet.