patternMinor

Writing SIMD libraries for C++ on FASM in x86-64 Linux

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

simdfasmlibrarieswritingx86forlinux

Problem

I have recently started a project of SIMD libraries development for C++ on FASM for x86-64 Linux.
I would be glad to hear any opinion or feedback about the project, cleanness of the code and documentation. Here is the project's web site on SourceForge.

This is just a fragment of code (addition of two vectors) and some kind of comments next to asm directives.

```
;==============================================================================;
; Binary operations ;
;==============================================================================;
macro SCALAR op, x
{
;---[Parameters]---------------------------
array equ rdi ; pointer to array
size equ rsi ; array size (count of elements)
value equ xmm0 ; value to process with
;---[Internal variables]-------------------
temp equ xmm1 ; temp value
if x eq s
bytes = 4 ; array element size (bytes)
else
bytes = 8 ; array element size (bytes)
end if
step = 16 / bytes ; step size (in bytes)
;------------------------------------------
sub size, step ; size -= step
jb .sclr ; if (size < step) then skip vector code
clone_flt value, x ; Duplicating value through the entire register
;---[Vector loop]--------------------------
@@: movup#x temp, [array] ; temp = array[0]
op#p#x temp, value ; do operation to temp value
movup#x [array], temp ; array[0] = temp
add array, 16 ; array++
sub size, step ; size -= step
jae @b ; do while (size >= step)
;------------------------------------------
.sclr: add size, step

Solution

array   equ     rdi                        ; pointer to array
size    equ     rsi                        ; array size (count of elements)
value   equ     xmm0                       ; value to process with

I don't particularly care for these. It seems like you're trying to make the code look more like a high-level language, but it seems to me that it ends up neither fish no fowl; it loses readability for those accustomed to assembly language without seeming to really gain much (if anything) for those accustomed to higher level languages.

temp    equ     xmm1                       ; temp value
if x eq s
bytes   = 4                                ; array element size (bytes)
else
bytes   = 8                                ; array element size (bytes)
end if
step    = 16 / bytes                       ; step size (in bytes)

I think I'd write this something more like:

chunk_size = 16
if x eq s
element_size = 4
else
element_size = 8
end if
step_size = chunk_size / element_size

;---[Vector loop]--------------------------

At least in my opinion, @@ labels should be reserved for times when the meaning is exceptionally obvious. Preceding an @@: with a comment describing its meaning indicates that you'd probably be better off with a normal label in this case.

With those, your code would come out closer to:

sub     rsi, step_size             ; size -= step
        jb      .sclr                      ; if (size < step) then skip vector code
    clone_flt   xmm0, x                    ; Duplicate value through entire register

vector_loop:
        movup#x xmm1, [rdi]                ; temp = array[0]
        op#p#x  xmm1, xmm0                 ; do operation to temp value
        movup#x [rdi], xmm0                ; array[0] = temp
        add     rdi, chunk_size            ; array++
        sub     rsi, step_size             ; size -= step
        jae     vector_loop                ; do while (size >= step)
;------------------------------------------
.sclr:  add     size, step                 ; size += step
        jz      .exit                      ; If no scalar code is required, then exit
scalar_loop:
        movs#x  xmm1, [rdi]                ; temp = array[0]
        op#s#x  xmm1, xmm0                 ; do operation to temp value
        movs#x  [rdi], temp                ; array[0] = temp
        add     rdi, element_size          ; array++
        dec     rsi                        ; size--
        jnz     scalar_loop                ; do while (size != 0)

The code itself is quite nicely done, especially doing subtraction before the main loop to avoid a cmp before the jmp in the main loop. Kudos!

I can only make a couple of possible suggestions about the code itself:

-
issuing some prefetch instructions if/when rsi is greater than a few hundred or so (but maybe it never is for your uses). Prefetching can be a little difficult to get right, and this is a simple linear pattern, so the prefetch hardware may well be perfectly adequate to the job.

-
unrolling a few iterations of the loop--but then again, especially for simple instructions, it may be memory bound already and unrolling would just make the code bigger (and pollute more code cache) with little or no speed gain to compensate.

Neither of these has any real certainty of improvement, but if you're feeling adventurous some day, they might be worthy of a little experimentation (if you haven't already).

Code Snippets

array   equ     rdi                        ; pointer to array
size    equ     rsi                        ; array size (count of elements)
value   equ     xmm0                       ; value to process with

temp    equ     xmm1                       ; temp value
if x eq s
bytes   = 4                                ; array element size (bytes)
else
bytes   = 8                                ; array element size (bytes)
end if
step    = 16 / bytes                       ; step size (in bytes)

chunk_size = 16
if x eq s
element_size = 4
else
element_size = 8
end if
step_size = chunk_size / element_size

;---[Vector loop]--------------------------

sub     rsi, step_size             ; size -= step
        jb      .sclr                      ; if (size &lt step) then skip vector code
    clone_flt   xmm0, x                    ; Duplicate value through entire register

vector_loop:
        movup#x xmm1, [rdi]                ; temp = array[0]
        op#p#x  xmm1, xmm0                 ; do operation to temp value
        movup#x [rdi], xmm0                ; array[0] = temp
        add     rdi, chunk_size            ; array++
        sub     rsi, step_size             ; size -= step
        jae     vector_loop                ; do while (size &gt;= step)
;------------------------------------------
.sclr:  add     size, step                 ; size += step
        jz      .exit                      ; If no scalar code is required, then exit
scalar_loop:
        movs#x  xmm1, [rdi]                ; temp = array[0]
        op#s#x  xmm1, xmm0                 ; do operation to temp value
        movs#x  [rdi], temp                ; array[0] = temp
        add     rdi, element_size          ; array++
        dec     rsi                        ; size--
        jnz     scalar_loop                ; do while (size != 0)

Context

StackExchange Code Review Q#14998, answer score: 5

Revisions (0)

No revisions yet.