patternModerate
Calculating the harmonic average
Viewed 0 times
calculatingtheaverageharmonic
Problem
In this program I was required to calculate the Harmonic Average using an assembly program with a C driver.
Harmonic mean is defined as:
$$ \frac{n}{\dfrac{1}{x_1} + \dfrac{1}{x_2} + \cdots + \dfrac{1}{x_n}}$$
where \$n = \$ number of inputs and \$x_i = \$ a real number.
I would like to know if:
Attached below is my program in Assembly:
Harmonic mean is defined as:
$$ \frac{n}{\dfrac{1}{x_1} + \dfrac{1}{x_2} + \cdots + \dfrac{1}{x_n}}$$
where \$n = \$ number of inputs and \$x_i = \$ a real number.
I would like to know if:
- My code follows common practices.
- My code is formatted correctly.
- If comments need to be improved.
- My code is straightforward; that is, I have achieved the goal with the least amount of work.
- There any improvements I should make in my code.
Attached below is my program in Assembly:
;Assembly function that computs the harmonic mean
;of an array of 64-bit floating-point numbers.
;Retrieves input using a C program.
;
;Harmonic mean is defined as Sum(n/((1/x1) + (1/x2) + ... + (1/xn)))
;
; expects:
; RDI - address of array
; RSI - length of the array
; returns
; XMMO - the harmonic average of array's values
global harmonicMean
section .data
Zero dd 0.0
One dd 1.0
section .text
harmonicMean:
push rbp
mov rbp, rsp ;C prologue
movss xmm10, [Zero] ;Holds tally of denominator
cvtsi2ss xmm0, rsi ;Take length and put it into xmm0 register
.whileLoop:
cmp rsi, 0 ;Is the length of array 0?
je .endwhile
call addDen ;Compute a denominator value and add it to sum
add rdi, 4 ;Add size of float to address
dec rsi ;Decrease the length
jmp .whileLoop
.endwhile:
divss xmm0, xmm10
leave
ret
;Calculates a number in the denominator
addDen:
push rdi
movss xmm8, [One]
movss xmm9, [rdi]
divss xmm8, xmm9
addss xmm10, xmm8
pop rdi
retSolution
Here are some things that may help you improve your code.
Omit C prolog and epilog if possible
The part you have correctly labeled "C prologue" in your code is used by the compiler to be able to access local variables. However both inputs and outputs to this assembly routine are in registers, so no stack manipulations are needed. For that reason, we can eliminate both in this code.
Don't
In the
Avoid subroutine calls
Subroutine calls are not good for fast assembly code. Your
Reduce jumps per loop iteration
The code as currently structured has both a comparison and conditional jump and also an unconditional
Use direct memory access for instructions
The current code uses this instruction sequence
However, the
This saves us a register (
Avoid compare instructions if practical
The current code uses
Take advantage of SIMD where practical
SIMD stands for Single Instruction, Multiple Data which allows the CPU to execute multiple operations in parallel, speeding up the routine. Because your code calculate reciprocals of each data item in the list, we can do that calculation in parallel (4 at a time using SSE) and save some time. If there aren't 4 data items, we can drop back to a simple serial approach as with what you've already written. Here's one way to write such a routine:
Test your results
With assembly language code like this, as with much code, the two things one generally looks for are correctness and speed.
Correctness
Although the algorithm is essentially identical in all versions, there is subtle difference that is inherent with the addition of floating point numbers. Specifically, with floating point numbers, adding a large number to a small number can lead to loss of precision. In extreme cases, if
Omit C prolog and epilog if possible
The part you have correctly labeled "C prologue" in your code is used by the compiler to be able to access local variables. However both inputs and outputs to this assembly routine are in registers, so no stack manipulations are needed. For that reason, we can eliminate both in this code.
Don't
push/pop registers you don't alterIn the
addDen routine, the rdi register is pushed and then popped off the stack, but not altered by the routine. As a result, neither push nor pop are required and both could be omitted.Avoid subroutine calls
Subroutine calls are not good for fast assembly code. Your
addDen code doesn't need to be a subroutine -- it can be placed inline, avoiding the overhead of a subroutine call.Reduce jumps per loop iteration
The code as currently structured has both a comparison and conditional jump and also an unconditional
jmp .whileLoop that both occur within each loop iteration. Better practice is to minimize the number of jumps within each loop; ideally one should strive to reduce it to exactly one.Use direct memory access for instructions
The current code uses this instruction sequence
movss xmm8, [One]
movss xmm9, [rdi]
divss xmm8, xmm9However, the
divss instruction can use memory directly as a source, so we can shorten this to:movss xmm8, [One]
divss xmm8, [rdi]This saves us a register (
xmm9) and some time.Avoid compare instructions if practical
The current code uses
cmp rsi, 0 but that's not really needed since the dec esi at the end of the loop already sets up the zero flag. Putting all of the previous hints together, a revised routine looks like this:section .data
harmonicMean2:
cvtsi2ss xmm0, rsi ; xmm0 = len = numerator
movss xmm10, [Zero] ; xmm10 = denominator
inc rsi
jmp .testzero
.more:
movss xmm8, [One]
divss xmm8, [rdi]
add rdi, 4
addss xmm10, xmm8
.testzero:
dec rsi
jnz .more
divss xmm0, xmm10
retTake advantage of SIMD where practical
SIMD stands for Single Instruction, Multiple Data which allows the CPU to execute multiple operations in parallel, speeding up the routine. Because your code calculate reciprocals of each data item in the list, we can do that calculation in parallel (4 at a time using SSE) and save some time. If there aren't 4 data items, we can drop back to a simple serial approach as with what you've already written. Here's one way to write such a routine:
section .data
One dd 1.0, 1.0, 1.0, 1.0
section .data
harmonicMeanP:
cvtsi2ss xmm0, rsi ; xmm0 = len = numerator
xorps xmm10, xmm10 ; xmm10 = denominator = 0
mov rax, rsi
shr rax, 2
jz .single
movups xmm9, [One] ; Loads series of 1s in
.doquad:
movaps xmm8, xmm9 ; copy those ones
divps xmm8, [rdi] ; fetch 4 floats
add rdi, 16
addps xmm10, xmm8
dec rax
jnz .doquad
movaps xmm8, xmm10 ; copy xmm10
shufps xmm8, xmm10, $04E ; swap low and high halves
addps xmm8, xmm10 ; sum now in xmm8
movaps xmm10, xmm8 ; copy xmm8
shufps xmm10, xmm8, $0B1 ; swap pairs into xmm10
addps xmm10, xmm8 ; now have 4 copies of sum in xmm10
and rsi, 3 ; less than 4 left
.single:
inc rsi
jmp .testzero
.more:
movss xmm8, [One]
divss xmm8, [rdi]
add rdi, 4
addss xmm10, xmm8
.testzero:
dec rsi
jnz .more
divss xmm0, xmm10
retTest your results
With assembly language code like this, as with much code, the two things one generally looks for are correctness and speed.
Correctness
Although the algorithm is essentially identical in all versions, there is subtle difference that is inherent with the addition of floating point numbers. Specifically, with floating point numbers, adding a large number to a small number can lead to loss of precision. In extreme cases, if
a is very large and b is very small, we can have a + b = a which is obviously incorrect for non-zero values of b. (See the oft-cited and still excellent What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg for a readable technical discussion for why this is so.) This can affect the harmonic mean calculations. For example, if we calculate the harmonic mean of the numbers \$1 \cdots 100000\$, in that order, the correct value is approximately 8271.198621. However, the original code reports 8270.716797 which is off by a bit. The reason is that by the time we get to 100000, we're adding a very small number (1e-5) to a relatively much bigger number of about 12.09. One can uCode Snippets
movss xmm8, [One]
movss xmm9, [rdi]
divss xmm8, xmm9movss xmm8, [One]
divss xmm8, [rdi]section .data
harmonicMean2:
cvtsi2ss xmm0, rsi ; xmm0 = len = numerator
movss xmm10, [Zero] ; xmm10 = denominator
inc rsi
jmp .testzero
.more:
movss xmm8, [One]
divss xmm8, [rdi]
add rdi, 4
addss xmm10, xmm8
.testzero:
dec rsi
jnz .more
divss xmm0, xmm10
retsection .data
One dd 1.0, 1.0, 1.0, 1.0
section .data
harmonicMeanP:
cvtsi2ss xmm0, rsi ; xmm0 = len = numerator
xorps xmm10, xmm10 ; xmm10 = denominator = 0
mov rax, rsi
shr rax, 2
jz .single
movups xmm9, [One] ; Loads series of 1s in
.doquad:
movaps xmm8, xmm9 ; copy those ones
divps xmm8, [rdi] ; fetch 4 floats
add rdi, 16
addps xmm10, xmm8
dec rax
jnz .doquad
movaps xmm8, xmm10 ; copy xmm10
shufps xmm8, xmm10, $04E ; swap low and high halves
addps xmm8, xmm10 ; sum now in xmm8
movaps xmm10, xmm8 ; copy xmm8
shufps xmm10, xmm8, $0B1 ; swap pairs into xmm10
addps xmm10, xmm8 ; now have 4 copies of sum in xmm10
and rsi, 3 ; less than 4 left
.single:
inc rsi
jmp .testzero
.more:
movss xmm8, [One]
divss xmm8, [rdi]
add rdi, 4
addss xmm10, xmm8
.testzero:
dec rsi
jnz .more
divss xmm0, xmm10
ret#include <algorithm>
#include <iostream>
#include <iomanip>
#include "stopwatch.h"
extern "C" float harmonicMean(float *arr, unsigned len);
extern "C" float harmonicMean2(float *arr, unsigned len);
extern "C" float harmonicMeanP(float *arr, unsigned len);
float hmean(float *arr, unsigned len)
{
double sum = 0;
double n = len;
for ( ; len; --len, ++arr)
sum += 1.0/(*arr);
return n/sum;
}
struct MeanTest
{
float (*calc)(float *, unsigned);
std::string name;
};
static MeanTest test[]{
{harmonicMean, "original" },
{harmonicMean2, "tweaked" },
{harmonicMeanP, "parallel" },
{hmean, "C++ double" },
{nullptr, ""},
};
int main()
{
const unsigned arrlen{100000};
const unsigned iterations{1000};
float nums[arrlen];
float hm;
std::iota(nums, nums+arrlen, 1);
Stopwatch<std::chrono::milliseconds> sw;
for (MeanTest *t = test; t->calc != nullptr; ++t) {
sw.start();
for (unsigned i = iterations; i; --i)
hm = t->calc(nums, arrlen);
sw.stop();
std::cout << std::setprecision(10) << hm << ", "
<< t->name << " time = " << sw.elapsed() << " ms\n";
}
}Context
StackExchange Code Review Q#71603, answer score: 17
Revisions (0)
No revisions yet.