patternMinor
Writing SIMD libraries for C++ on FASM in x86-64 Linux
Viewed 0 times
simdfasmlibrarieswritingx86forlinux
Problem
I have recently started a project of SIMD libraries development for C++ on FASM for x86-64 Linux.
I would be glad to hear any opinion or feedback about the project, cleanness of the code and documentation. Here is the project's web site on SourceForge.
This is just a fragment of code (addition of two vectors) and some kind of comments next to asm directives.
```
;==============================================================================;
; Binary operations ;
;==============================================================================;
macro SCALAR op, x
{
;---[Parameters]---------------------------
array equ rdi ; pointer to array
size equ rsi ; array size (count of elements)
value equ xmm0 ; value to process with
;---[Internal variables]-------------------
temp equ xmm1 ; temp value
if x eq s
bytes = 4 ; array element size (bytes)
else
bytes = 8 ; array element size (bytes)
end if
step = 16 / bytes ; step size (in bytes)
;------------------------------------------
sub size, step ; size -= step
jb .sclr ; if (size < step) then skip vector code
clone_flt value, x ; Duplicating value through the entire register
;---[Vector loop]--------------------------
@@: movup#x temp, [array] ; temp = array[0]
op#p#x temp, value ; do operation to temp value
movup#x [array], temp ; array[0] = temp
add array, 16 ; array++
sub size, step ; size -= step
jae @b ; do while (size >= step)
;------------------------------------------
.sclr: add size, step
I would be glad to hear any opinion or feedback about the project, cleanness of the code and documentation. Here is the project's web site on SourceForge.
This is just a fragment of code (addition of two vectors) and some kind of comments next to asm directives.
```
;==============================================================================;
; Binary operations ;
;==============================================================================;
macro SCALAR op, x
{
;---[Parameters]---------------------------
array equ rdi ; pointer to array
size equ rsi ; array size (count of elements)
value equ xmm0 ; value to process with
;---[Internal variables]-------------------
temp equ xmm1 ; temp value
if x eq s
bytes = 4 ; array element size (bytes)
else
bytes = 8 ; array element size (bytes)
end if
step = 16 / bytes ; step size (in bytes)
;------------------------------------------
sub size, step ; size -= step
jb .sclr ; if (size < step) then skip vector code
clone_flt value, x ; Duplicating value through the entire register
;---[Vector loop]--------------------------
@@: movup#x temp, [array] ; temp = array[0]
op#p#x temp, value ; do operation to temp value
movup#x [array], temp ; array[0] = temp
add array, 16 ; array++
sub size, step ; size -= step
jae @b ; do while (size >= step)
;------------------------------------------
.sclr: add size, step
Solution
array equ rdi ; pointer to array
size equ rsi ; array size (count of elements)
value equ xmm0 ; value to process withI don't particularly care for these. It seems like you're trying to make the code look more like a high-level language, but it seems to me that it ends up neither fish no fowl; it loses readability for those accustomed to assembly language without seeming to really gain much (if anything) for those accustomed to higher level languages.
temp equ xmm1 ; temp value
if x eq s
bytes = 4 ; array element size (bytes)
else
bytes = 8 ; array element size (bytes)
end if
step = 16 / bytes ; step size (in bytes)I think I'd write this something more like:
chunk_size = 16
if x eq s
element_size = 4
else
element_size = 8
end if
step_size = chunk_size / element_size;---[Vector loop]--------------------------At least in my opinion,
@@ labels should be reserved for times when the meaning is exceptionally obvious. Preceding an @@: with a comment describing its meaning indicates that you'd probably be better off with a normal label in this case.With those, your code would come out closer to:
sub rsi, step_size ; size -= step
jb .sclr ; if (size < step) then skip vector code
clone_flt xmm0, x ; Duplicate value through entire register
vector_loop:
movup#x xmm1, [rdi] ; temp = array[0]
op#p#x xmm1, xmm0 ; do operation to temp value
movup#x [rdi], xmm0 ; array[0] = temp
add rdi, chunk_size ; array++
sub rsi, step_size ; size -= step
jae vector_loop ; do while (size >= step)
;------------------------------------------
.sclr: add size, step ; size += step
jz .exit ; If no scalar code is required, then exit
scalar_loop:
movs#x xmm1, [rdi] ; temp = array[0]
op#s#x xmm1, xmm0 ; do operation to temp value
movs#x [rdi], temp ; array[0] = temp
add rdi, element_size ; array++
dec rsi ; size--
jnz scalar_loop ; do while (size != 0)The code itself is quite nicely done, especially doing subtraction before the main loop to avoid a
cmp before the jmp in the main loop. Kudos!I can only make a couple of possible suggestions about the code itself:
-
issuing some prefetch instructions if/when rsi is greater than a few hundred or so (but maybe it never is for your uses). Prefetching can be a little difficult to get right, and this is a simple linear pattern, so the prefetch hardware may well be perfectly adequate to the job.
-
unrolling a few iterations of the loop--but then again, especially for simple instructions, it may be memory bound already and unrolling would just make the code bigger (and pollute more code cache) with little or no speed gain to compensate.
Neither of these has any real certainty of improvement, but if you're feeling adventurous some day, they might be worthy of a little experimentation (if you haven't already).
Code Snippets
array equ rdi ; pointer to array
size equ rsi ; array size (count of elements)
value equ xmm0 ; value to process withtemp equ xmm1 ; temp value
if x eq s
bytes = 4 ; array element size (bytes)
else
bytes = 8 ; array element size (bytes)
end if
step = 16 / bytes ; step size (in bytes)chunk_size = 16
if x eq s
element_size = 4
else
element_size = 8
end if
step_size = chunk_size / element_size;---[Vector loop]--------------------------sub rsi, step_size ; size -= step
jb .sclr ; if (size < step) then skip vector code
clone_flt xmm0, x ; Duplicate value through entire register
vector_loop:
movup#x xmm1, [rdi] ; temp = array[0]
op#p#x xmm1, xmm0 ; do operation to temp value
movup#x [rdi], xmm0 ; array[0] = temp
add rdi, chunk_size ; array++
sub rsi, step_size ; size -= step
jae vector_loop ; do while (size >= step)
;------------------------------------------
.sclr: add size, step ; size += step
jz .exit ; If no scalar code is required, then exit
scalar_loop:
movs#x xmm1, [rdi] ; temp = array[0]
op#s#x xmm1, xmm0 ; do operation to temp value
movs#x [rdi], temp ; array[0] = temp
add rdi, element_size ; array++
dec rsi ; size--
jnz scalar_loop ; do while (size != 0)Context
StackExchange Code Review Q#14998, answer score: 5
Revisions (0)
No revisions yet.