patterncModerate

Distance between two n-dimensional points (NASM)

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

codereview c stackoverflow assembly floating-point coordinate-system

pointsdistancetwobetweendimensionalnasm

Problem

I just finished writing a function that computes the distance between two n-dimensional points.

The original one was written in C and it's basically a translation of this formula:

\$\text{dist}(x,y)=\sqrt{\sum_{i^1}^d{(x_i-y_i)^2}}\$

Here's the C code for completeness sake:

void distanceC(float* source, float* destination, int dimensions, float* result){       
    // sqrt(sum((source-destination)^2, dimensions)))
    int i;
    for (i = 0; i < dimensions; i++){
        *result += powf(source[i] - destination[i], 2);
    }
    *result = sqrtf(*result);
}

Finally, here's the NASM version that I wrote (I stripped some useless parts):

```
section .data

; masks used to set exceeding items to 0 when vectors are not multiple of 4

align 16
mask1: dd 0xffffffff, 0x00, 0x00, 0x00
align 16
mask2: dd 0xffffffff, 0xffffffff, 0x00, 0x00
align 16
mask3: dd 0xffffffff, 0xffffffff, 0xffffffff, 0x00

section .text

Source equ 8
Destination equ 12
Dimensions equ 16
Result equ 20

; ------------------------------------------------------------
; Distance between two points in n-dimensional space
; ------------------------------------------------------------
distance:
; ------------------------------------------------------------
; Entering the function
; ------------------------------------------------------------
push ebp
mov ebp, esp
sub esp, 4
push ebx
push esi
push edi

; ------------------------------------------------------------
; Reading parameters from stack
; ------------------------------------------------------------
mov esi, [ebp+Source]
mov edi, [ebp+Destination]
mov ecx, [ebp+Dimensions]
mov eax, [ebp+Result]

; let's compute how many iteration

Solution

Wrong increment

First, there is a bug in your main loop. This line:

add     ebx, 1

should be:

add     ebx, 4

because you are operating on 4 floats at a time. Right now, you are doing array elements [0..3] following by [1..4]. On my computer the program crashed because it had a problem doing an unaligned load.

Simplify divide by 4

This part where you divide ecx by 4:

push    eax                     
    mov     edx, 0                  
    mov     eax, ecx                
    mov     ebx, 4                  
    div     ebx                     
    mov     ecx, eax                ; ecx becomes the counter for the packed iterations
    pop     eax

can be simplified to this:

mov     edx, ecx
    and     edx, 3
    shr     ecx, 2

Explanation: shr is the shift right instruction. When you shift right by 2 it is the same thing as dividing by 4. To get the remainder of something divided by 4, you just need to and that number with 3. C equivalent code:

remainder = length & 3;
    length >>= 2;

Remove one jump from main loop

Your main loop:

.loopP: movaps  xmm1, [esi+ebx*4]
        movaps  xmm2, [edi+ebx*4]
        subps   xmm1, xmm2
        mulps   xmm1, xmm1
        dec     ecx
        haddps  xmm1, xmm1
        haddps  xmm1, xmm1
        addps   xmm0, xmm1
        jz      .loopC
        add     ebx, 4
        jmp     .loopP

.loopC  cmp     edx, 0
        je      .endD

Can be rewritten to remove one of the jumps at the end:

.loopP: movaps  xmm1, [esi+ebx*4]
        movaps  xmm2, [edi+ebx*4]
        subps   xmm1, xmm2
        add     ebx, 4
        mulps   xmm1, xmm1
        dec     ecx
        haddps  xmm1, xmm1
        haddps  xmm1, xmm1
        addps   xmm0, xmm1
        jnz      .loopP

        cmp     edx, 0
        je      .endD

By the way, I also noticed another bug here. When you exit the loop you currently are skipping the add ebx, 4 line. But you need to advance ebx or else the remainder part below won't have the right value for ebx. This works correctly in my modified version because I always add to ebx inside the loop.

Loading the mask

The part where you do a chain of compares to load one of three masks:

.dU1:   cmp     edx, 1          
        jne     .dU2
        movaps  xmm7, [mask1]           
        jmp     .dU

.dU2:   cmp     edx, 2
        jne     .dU3
        movaps  xmm7, [mask2]
        jmp     .dU

.dU3:   movaps  xmm7, [mask3]

can be simplified to this:

.dU1:   shl     edx, 4
        movaps  xmm7, [mask1 + edx - 16]

Explanation: Your masks are located in memory in consecutive order like an array of 128 bit values. So you can load them by index. Note that edx is 1..3 here. The shl instruction is shift left, and shifting left by 4 multiplies edx by 16. Then we load from mask1 + edx - 16. The minus 16 is because edx started at one instead of zero. Equivalent C code is something like this:

uint8_t mask1[48] = { /* values */ };
        mask128 = *(__m128 *) &mask1[(remainder<<4)-16];

Stack push/pop

I'm not sure if this is part of your calling convention, but your prologue and epilogue have extraneous instructions:

push    ebp
    mov     ebp, esp
    sub     esp, 4
    push    ebx
    push    esi
    push    edi

    pop     edi 
    pop     esi
    pop     ebx
    mov     esp, ebp
    pop     ebp
    ret

Could be changed to:

push    ebp
    mov     ebp, esp
    push    ebx
    push    esi
    push    edi

    pop     edi
    pop     esi
    pop     ebx
    pop     ebp
    ret

Furthermore, you are not utilizing eax. If you used eax instead of ebx everywhere in your function, you wouldn't have to save ebx because you wouldn't ever use it. You can load the result pointer into eax at the very end instead of at the beginning.

Comments

In general I liked all your comments. The one thing I thought you could add a comment for was that this double instruction:

haddps  xmm1, xmm1
        haddps  xmm1, xmm1

adds all 4 floats in xmm1 into the first float in xmm1. I wasn't familiar with the haddps instruction and I initially thought you made a typo (I thought a single haddps would do the whole job).

Alignment required

I was able to use the function successfully, but when I modified my main(), the program crashed. The difference was that my modification caused the arrays I passed to distance() to be not aligned to 16 bytes. Since the movaps instruction expects 128 bit alignment, the arrays passed in need to have 128 bit alignment. I think you should switch to the movups instruction which handles unaligned floats.

Code Snippets

add     ebx, 1

add     ebx, 4

push    eax                     
    mov     edx, 0                  
    mov     eax, ecx                
    mov     ebx, 4                  
    div     ebx                     
    mov     ecx, eax                ; ecx becomes the counter for the packed iterations
    pop     eax

mov     edx, ecx
    and     edx, 3
    shr     ecx, 2

remainder = length & 3;
    length >>= 2;

Context

StackExchange Code Review Q#90671, answer score: 12

Revisions (0)

No revisions yet.