patterncModerate
Distance between two n-dimensional points (NASM)
Viewed 0 times
pointsdistancetwobetweendimensionalnasm
Problem
I just finished writing a function that computes the distance between two n-dimensional points.
The original one was written in C and it's basically a translation of this formula:
\$\text{dist}(x,y)=\sqrt{\sum_{i^1}^d{(x_i-y_i)^2}}\$
Here's the C code for completeness sake:
Finally, here's the NASM version that I wrote (I stripped some useless parts):
```
section .data
; masks used to set exceeding items to 0 when vectors are not multiple of 4
align 16
mask1: dd 0xffffffff, 0x00, 0x00, 0x00
align 16
mask2: dd 0xffffffff, 0xffffffff, 0x00, 0x00
align 16
mask3: dd 0xffffffff, 0xffffffff, 0xffffffff, 0x00
section .text
Source equ 8
Destination equ 12
Dimensions equ 16
Result equ 20
; ------------------------------------------------------------
; Distance between two points in n-dimensional space
; ------------------------------------------------------------
distance:
; ------------------------------------------------------------
; Entering the function
; ------------------------------------------------------------
push ebp
mov ebp, esp
sub esp, 4
push ebx
push esi
push edi
; ------------------------------------------------------------
; Reading parameters from stack
; ------------------------------------------------------------
mov esi, [ebp+Source]
mov edi, [ebp+Destination]
mov ecx, [ebp+Dimensions]
mov eax, [ebp+Result]
; let's compute how many iteration
The original one was written in C and it's basically a translation of this formula:
\$\text{dist}(x,y)=\sqrt{\sum_{i^1}^d{(x_i-y_i)^2}}\$
Here's the C code for completeness sake:
void distanceC(float* source, float* destination, int dimensions, float* result){
// sqrt(sum((source-destination)^2, dimensions)))
int i;
for (i = 0; i < dimensions; i++){
*result += powf(source[i] - destination[i], 2);
}
*result = sqrtf(*result);
}Finally, here's the NASM version that I wrote (I stripped some useless parts):
```
section .data
; masks used to set exceeding items to 0 when vectors are not multiple of 4
align 16
mask1: dd 0xffffffff, 0x00, 0x00, 0x00
align 16
mask2: dd 0xffffffff, 0xffffffff, 0x00, 0x00
align 16
mask3: dd 0xffffffff, 0xffffffff, 0xffffffff, 0x00
section .text
Source equ 8
Destination equ 12
Dimensions equ 16
Result equ 20
; ------------------------------------------------------------
; Distance between two points in n-dimensional space
; ------------------------------------------------------------
distance:
; ------------------------------------------------------------
; Entering the function
; ------------------------------------------------------------
push ebp
mov ebp, esp
sub esp, 4
push ebx
push esi
push edi
; ------------------------------------------------------------
; Reading parameters from stack
; ------------------------------------------------------------
mov esi, [ebp+Source]
mov edi, [ebp+Destination]
mov ecx, [ebp+Dimensions]
mov eax, [ebp+Result]
; let's compute how many iteration
Solution
Wrong increment
First, there is a bug in your main loop. This line:
should be:
because you are operating on 4 floats at a time. Right now, you are doing array elements [0..3] following by [1..4]. On my computer the program crashed because it had a problem doing an unaligned load.
Simplify divide by 4
This part where you divide ecx by 4:
can be simplified to this:
Explanation:
Remove one jump from main loop
Your main loop:
Can be rewritten to remove one of the jumps at the end:
By the way, I also noticed another bug here. When you exit the loop you currently are skipping the
Loading the mask
The part where you do a chain of compares to load one of three masks:
can be simplified to this:
Explanation: Your masks are located in memory in consecutive order like an array of 128 bit values. So you can load them by index. Note that
Stack push/pop
I'm not sure if this is part of your calling convention, but your prologue and epilogue have extraneous instructions:
Could be changed to:
Furthermore, you are not utilizing
Comments
In general I liked all your comments. The one thing I thought you could add a comment for was that this double instruction:
adds all 4 floats in
Alignment required
I was able to use the function successfully, but when I modified my
First, there is a bug in your main loop. This line:
add ebx, 1should be:
add ebx, 4because you are operating on 4 floats at a time. Right now, you are doing array elements [0..3] following by [1..4]. On my computer the program crashed because it had a problem doing an unaligned load.
Simplify divide by 4
This part where you divide ecx by 4:
push eax
mov edx, 0
mov eax, ecx
mov ebx, 4
div ebx
mov ecx, eax ; ecx becomes the counter for the packed iterations
pop eaxcan be simplified to this:
mov edx, ecx
and edx, 3
shr ecx, 2Explanation:
shr is the shift right instruction. When you shift right by 2 it is the same thing as dividing by 4. To get the remainder of something divided by 4, you just need to and that number with 3. C equivalent code:remainder = length & 3;
length >>= 2;Remove one jump from main loop
Your main loop:
.loopP: movaps xmm1, [esi+ebx*4]
movaps xmm2, [edi+ebx*4]
subps xmm1, xmm2
mulps xmm1, xmm1
dec ecx
haddps xmm1, xmm1
haddps xmm1, xmm1
addps xmm0, xmm1
jz .loopC
add ebx, 4
jmp .loopP
.loopC cmp edx, 0
je .endDCan be rewritten to remove one of the jumps at the end:
.loopP: movaps xmm1, [esi+ebx*4]
movaps xmm2, [edi+ebx*4]
subps xmm1, xmm2
add ebx, 4
mulps xmm1, xmm1
dec ecx
haddps xmm1, xmm1
haddps xmm1, xmm1
addps xmm0, xmm1
jnz .loopP
cmp edx, 0
je .endDBy the way, I also noticed another bug here. When you exit the loop you currently are skipping the
add ebx, 4 line. But you need to advance ebx or else the remainder part below won't have the right value for ebx. This works correctly in my modified version because I always add to ebx inside the loop.Loading the mask
The part where you do a chain of compares to load one of three masks:
.dU1: cmp edx, 1
jne .dU2
movaps xmm7, [mask1]
jmp .dU
.dU2: cmp edx, 2
jne .dU3
movaps xmm7, [mask2]
jmp .dU
.dU3: movaps xmm7, [mask3]can be simplified to this:
.dU1: shl edx, 4
movaps xmm7, [mask1 + edx - 16]Explanation: Your masks are located in memory in consecutive order like an array of 128 bit values. So you can load them by index. Note that
edx is 1..3 here. The shl instruction is shift left, and shifting left by 4 multiplies edx by 16. Then we load from mask1 + edx - 16. The minus 16 is because edx started at one instead of zero. Equivalent C code is something like this:uint8_t mask1[48] = { /* values */ };
mask128 = *(__m128 *) &mask1[(remainder<<4)-16];Stack push/pop
I'm not sure if this is part of your calling convention, but your prologue and epilogue have extraneous instructions:
push ebp
mov ebp, esp
sub esp, 4
push ebx
push esi
push edi
pop edi
pop esi
pop ebx
mov esp, ebp
pop ebp
retCould be changed to:
push ebp
mov ebp, esp
push ebx
push esi
push edi
pop edi
pop esi
pop ebx
pop ebp
retFurthermore, you are not utilizing
eax. If you used eax instead of ebx everywhere in your function, you wouldn't have to save ebx because you wouldn't ever use it. You can load the result pointer into eax at the very end instead of at the beginning.Comments
In general I liked all your comments. The one thing I thought you could add a comment for was that this double instruction:
haddps xmm1, xmm1
haddps xmm1, xmm1adds all 4 floats in
xmm1 into the first float in xmm1. I wasn't familiar with the haddps instruction and I initially thought you made a typo (I thought a single haddps would do the whole job).Alignment required
I was able to use the function successfully, but when I modified my
main(), the program crashed. The difference was that my modification caused the arrays I passed to distance() to be not aligned to 16 bytes. Since the movaps instruction expects 128 bit alignment, the arrays passed in need to have 128 bit alignment. I think you should switch to the movups instruction which handles unaligned floats.Code Snippets
add ebx, 1add ebx, 4push eax
mov edx, 0
mov eax, ecx
mov ebx, 4
div ebx
mov ecx, eax ; ecx becomes the counter for the packed iterations
pop eaxmov edx, ecx
and edx, 3
shr ecx, 2remainder = length & 3;
length >>= 2;Context
StackExchange Code Review Q#90671, answer score: 12
Revisions (0)
No revisions yet.