patternMinor
Computing the Fourier transform of a distribution of electron charge (represented by atomic positions in space)
Viewed 0 times
representedpositionsfourierthespaceelectronchargetransformdistributionatomic
Problem
I am computing a nested loop operation using OpenCL (open computing language). My main question is, given the code outlined below, how might I optimize the speed and efficiency of using the GPU, for instance by using local memory calls instead of global ones?
Given some input vector \$V=(v_x, v_y, v_z)\$ (in math terms, the input vector is a point in reciprocal space where we evaluate the Fourier transform), the desired computation involves iterating over a collection of atoms in space (of different species, like Hydrogen, Oxygen, etc.) and using the input vector to compute a function. Each atom is described by a 4-tuple, consisting of 3 spatial coordinates, \$A=(a_x, a_y, a_z)\$, and 1 species identifier,
The iterations look like this (in pseudocode):
where
The idea of using a GPU for this problem is to let each GPU worker compute the output for one vector
The inputs to the kernel are the following:
Here is the kernel (note the use of cosine an
Given some input vector \$V=(v_x, v_y, v_z)\$ (in math terms, the input vector is a point in reciprocal space where we evaluate the Fourier transform), the desired computation involves iterating over a collection of atoms in space (of different species, like Hydrogen, Oxygen, etc.) and using the input vector to compute a function. Each atom is described by a 4-tuple, consisting of 3 spatial coordinates, \$A=(a_x, a_y, a_z)\$, and 1 species identifier,
species_id, which ranges from 0 to the number of unique atom species.The iterations look like this (in pseudocode):
for V in input_vectors:
ft_magnitude = 0
for (A, species_id) in atoms:
ft_magnitude += get_scale_factor(V, species_id) * exp( -i * dot(A,V))where
i in the exponential is the complex number \$i=\sqrt{-1}\$, the term dot(A,V) is the dot product of two vectors \$A \cdot V = a_x \cdot v_x + a_y \cdot v_y + a_z \cdot v_z\$, and get_scale_factor(V,species_id) is a lookup operation which returns the correct, pre-computed scale factor, which changes with each vector V and species_id.The idea of using a GPU for this problem is to let each GPU worker compute the output for one vector
V. I wrote a kernel posted below to be used with OpenCL, after following several examples I saw online.The inputs to the kernel are the following:
input_vectorswhich has length3*n_vectors, 3 per input vector
atomswhich has length4*n_atoms, 4 per atom
scale_factorswhich has lengthn_species * n_vectors, 1 per atom species per input vector
outputswhich has length2*n_vectors, 1 complex number per input vector, representing the Fourier transform magnitude
Here is the kernel (note the use of cosine an
Solution
The most immediate optimization is to use coalesced reads by switching to array of structs form to structure of arrays.
Context
StackExchange Code Review Q#159748, answer score: 2
Revisions (0)
No revisions yet.