HiveBrain v1.2.0
Get Started
← Back to all entries
patternModerate

Calculating the harmonic average

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
calculatingtheaverageharmonic

Problem

In this program I was required to calculate the Harmonic Average using an assembly program with a C driver.

Harmonic mean is defined as:

$$ \frac{n}{\dfrac{1}{x_1} + \dfrac{1}{x_2} + \cdots + \dfrac{1}{x_n}}$$

where \$n = \$ number of inputs and \$x_i = \$ a real number.

I would like to know if:

  • My code follows common practices.



  • My code is formatted correctly.



  • If comments need to be improved.



  • My code is straightforward; that is, I have achieved the goal with the least amount of work.



  • There any improvements I should make in my code.



Attached below is my program in Assembly:

;Assembly function that computs the harmonic mean
;of an array of 64-bit floating-point numbers.
;Retrieves input using a C program.
;
;Harmonic mean is defined as Sum(n/((1/x1) + (1/x2) + ... + (1/xn)))
;
;   expects:
;       RDI - address of array
;       RSI - length of the array
;   returns
;       XMMO - the harmonic average of array's values

global harmonicMean

section .data
    Zero dd 0.0
    One dd 1.0

section .text
    harmonicMean:
        push rbp
        mov rbp, rsp              ;C prologue

        movss xmm10, [Zero]       ;Holds tally of denominator
        cvtsi2ss xmm0, rsi        ;Take length and put it into xmm0 register

        .whileLoop:
            cmp rsi, 0            ;Is the length of array 0?
            je .endwhile
            call addDen           ;Compute a denominator value and add it to sum
            add rdi, 4            ;Add size of float to address
            dec rsi               ;Decrease the length
            jmp .whileLoop
        .endwhile:

        divss xmm0, xmm10

        leave
        ret

    ;Calculates a number in the denominator
    addDen:
        push rdi
        movss xmm8, [One]
        movss xmm9, [rdi]
        divss xmm8, xmm9
        addss xmm10, xmm8
        pop rdi
        ret

Solution

Here are some things that may help you improve your code.

Omit C prolog and epilog if possible

The part you have correctly labeled "C prologue" in your code is used by the compiler to be able to access local variables. However both inputs and outputs to this assembly routine are in registers, so no stack manipulations are needed. For that reason, we can eliminate both in this code.

Don't push/pop registers you don't alter

In the addDen routine, the rdi register is pushed and then popped off the stack, but not altered by the routine. As a result, neither push nor pop are required and both could be omitted.

Avoid subroutine calls

Subroutine calls are not good for fast assembly code. Your addDen code doesn't need to be a subroutine -- it can be placed inline, avoiding the overhead of a subroutine call.

Reduce jumps per loop iteration

The code as currently structured has both a comparison and conditional jump and also an unconditional jmp .whileLoop that both occur within each loop iteration. Better practice is to minimize the number of jumps within each loop; ideally one should strive to reduce it to exactly one.

Use direct memory access for instructions

The current code uses this instruction sequence

movss xmm8, [One]
    movss xmm9, [rdi]
    divss xmm8, xmm9


However, the divss instruction can use memory directly as a source, so we can shorten this to:

movss xmm8, [One]
    divss xmm8, [rdi]


This saves us a register (xmm9) and some time.

Avoid compare instructions if practical

The current code uses cmp rsi, 0 but that's not really needed since the dec esi at the end of the loop already sets up the zero flag. Putting all of the previous hints together, a revised routine looks like this:

section .data
    harmonicMean2:
        cvtsi2ss xmm0, rsi        ; xmm0 = len = numerator
        movss xmm10, [Zero]       ; xmm10 = denominator
        inc rsi
        jmp .testzero
    .more:
        movss xmm8, [One]
        divss xmm8, [rdi]
        add rdi, 4
        addss xmm10, xmm8
    .testzero:
        dec rsi
        jnz .more
        divss xmm0, xmm10
        ret


Take advantage of SIMD where practical

SIMD stands for Single Instruction, Multiple Data which allows the CPU to execute multiple operations in parallel, speeding up the routine. Because your code calculate reciprocals of each data item in the list, we can do that calculation in parallel (4 at a time using SSE) and save some time. If there aren't 4 data items, we can drop back to a simple serial approach as with what you've already written. Here's one way to write such a routine:

section .data
    One dd 1.0, 1.0, 1.0, 1.0

section .data
    harmonicMeanP:
        cvtsi2ss xmm0, rsi      ; xmm0 = len = numerator
        xorps xmm10, xmm10      ; xmm10 = denominator = 0
        mov rax, rsi
        shr rax, 2
        jz .single
        movups  xmm9, [One]     ; Loads series of 1s in 
    .doquad:
        movaps  xmm8, xmm9      ; copy those ones
        divps   xmm8, [rdi]     ; fetch 4 floats
        add rdi, 16
        addps   xmm10, xmm8

        dec rax
        jnz .doquad
        movaps  xmm8, xmm10         ; copy xmm10
        shufps  xmm8, xmm10, $04E   ; swap low and high halves
        addps   xmm8, xmm10         ; sum now in xmm8
        movaps  xmm10, xmm8         ; copy xmm8
        shufps  xmm10, xmm8, $0B1   ; swap pairs into xmm10
        addps   xmm10, xmm8         ; now have 4 copies of sum in xmm10
        and rsi, 3                  ; less than 4 left
    .single:
        inc rsi
        jmp .testzero
    .more:
        movss xmm8, [One]
        divss xmm8, [rdi]
        add rdi, 4
        addss xmm10, xmm8
    .testzero:
        dec rsi
        jnz .more
        divss xmm0, xmm10
        ret


Test your results

With assembly language code like this, as with much code, the two things one generally looks for are correctness and speed.

Correctness

Although the algorithm is essentially identical in all versions, there is subtle difference that is inherent with the addition of floating point numbers. Specifically, with floating point numbers, adding a large number to a small number can lead to loss of precision. In extreme cases, if a is very large and b is very small, we can have a + b = a which is obviously incorrect for non-zero values of b. (See the oft-cited and still excellent What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg for a readable technical discussion for why this is so.) This can affect the harmonic mean calculations. For example, if we calculate the harmonic mean of the numbers \$1 \cdots 100000\$, in that order, the correct value is approximately 8271.198621. However, the original code reports 8270.716797 which is off by a bit. The reason is that by the time we get to 100000, we're adding a very small number (1e-5) to a relatively much bigger number of about 12.09. One can u

Code Snippets

movss xmm8, [One]
    movss xmm9, [rdi]
    divss xmm8, xmm9
movss xmm8, [One]
    divss xmm8, [rdi]
section .data
    harmonicMean2:
        cvtsi2ss xmm0, rsi        ; xmm0 = len = numerator
        movss xmm10, [Zero]       ; xmm10 = denominator
        inc rsi
        jmp .testzero
    .more:
        movss xmm8, [One]
        divss xmm8, [rdi]
        add rdi, 4
        addss xmm10, xmm8
    .testzero:
        dec rsi
        jnz .more
        divss xmm0, xmm10
        ret
section .data
    One dd 1.0, 1.0, 1.0, 1.0

section .data
    harmonicMeanP:
        cvtsi2ss xmm0, rsi      ; xmm0 = len = numerator
        xorps xmm10, xmm10      ; xmm10 = denominator = 0
        mov rax, rsi
        shr rax, 2
        jz .single
        movups  xmm9, [One]     ; Loads series of 1s in 
    .doquad:
        movaps  xmm8, xmm9      ; copy those ones
        divps   xmm8, [rdi]     ; fetch 4 floats
        add rdi, 16
        addps   xmm10, xmm8

        dec rax
        jnz .doquad
        movaps  xmm8, xmm10         ; copy xmm10
        shufps  xmm8, xmm10, $04E   ; swap low and high halves
        addps   xmm8, xmm10         ; sum now in xmm8
        movaps  xmm10, xmm8         ; copy xmm8
        shufps  xmm10, xmm8, $0B1   ; swap pairs into xmm10
        addps   xmm10, xmm8         ; now have 4 copies of sum in xmm10
        and rsi, 3                  ; less than 4 left
    .single:
        inc rsi
        jmp .testzero
    .more:
        movss xmm8, [One]
        divss xmm8, [rdi]
        add rdi, 4
        addss xmm10, xmm8
    .testzero:
        dec rsi
        jnz .more
        divss xmm0, xmm10
        ret
#include <algorithm>
    #include <iostream>
    #include <iomanip>
    #include "stopwatch.h"

    extern "C" float harmonicMean(float *arr, unsigned len);
    extern "C" float harmonicMean2(float *arr, unsigned len);
    extern "C" float harmonicMeanP(float *arr, unsigned len);
    float hmean(float *arr, unsigned len)
    {
        double sum = 0;
        double n = len;
        for ( ; len; --len, ++arr)
            sum += 1.0/(*arr);
        return n/sum;
    }

    struct MeanTest
    {
        float (*calc)(float *, unsigned);
        std::string name;
    };

    static MeanTest test[]{ 
        {harmonicMean, "original" },
        {harmonicMean2, "tweaked" },
        {harmonicMeanP, "parallel" },
        {hmean, "C++ double" },
        {nullptr, ""},
    };

    int main()
    {
        const unsigned arrlen{100000};
        const unsigned iterations{1000};
        float nums[arrlen];
        float hm;

        std::iota(nums, nums+arrlen, 1);

        Stopwatch<std::chrono::milliseconds> sw;
        for (MeanTest *t = test; t->calc != nullptr; ++t) {
            sw.start();
            for (unsigned i = iterations; i; --i)
                hm = t->calc(nums, arrlen);
            sw.stop();
            std::cout << std::setprecision(10) << hm << ", " 
                << t->name << " time = " << sw.elapsed() << " ms\n";
        }
    }

Context

StackExchange Code Review Q#71603, answer score: 17

Revisions (0)

No revisions yet.