patterncppMinor

Addition of two IEEE754-32bit single precision floating point numbers in C++

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

codereview cpp stackoverflow c++floating-point bitwise

32bitieee754pointfloatingnumbersadditionprecisiontwosingle

Problem

I made a C++ program which takes a list of triplets with S.No. from a text file, where the triplet, i.e. a,b,c correspond to a+b=c (addition done using float data type). Now I convert a and b from hex to 32-bit binary numbers and extract sign, mantissa and fractional part and then add them. Finally I convert the sum back to hexadecimal representation and compare to c.

Explanation:

[Note: '0b': binary '0x': hex]

Take the test case 4 be954bb1 c2a2c2e1 c2a3582d.

Now a = 0xbe954bb1 = 0b10111110100101010100101110110001 .

And b = 0xc2a2c2e1 = 0b11000010101000101100001011100001 .

In IEEE754 first bit is sign which is 1 for both hence both are negative. next eight bits are for exponent i.e. 0b01111101 for a and 0b10000101 which correspond to 125 and 133 in decimal. These exponents have a offset of 127 so actual exponents are 125-127=-2 and 133-127=6

Rest bits are mantissa and the actual floating point number is 1.mantissa x 2^exponent where 1.mantissa is in binary. So our numbers are 1.00101010100101110110001 x 2^-2 and 1.01000101100001011100001 x 2^6

For adding we make the exponent same (the larger one, i.e. 6), hence we have 1.00101010100101110110001 x 2^-2 + 1.01000101100001011100001 x 2^6 = 0.0000000100101010100101110110001 x 2^6 + 1.01000101100001011100001 x 2^6 = ...

Code:

Now I extracted the sign, magnitude and mantissa. Note that the above mantissa will correspond to 0b00101010100101110110001 and 0b01000101100001011100001 and that is 1395633 and 2278113 in decimal.

We are working with integers only so we will not multiply both by same factor, i.e. 2^-23 to convert to .mantissa rather we add 2^23 to both to get 9784241 and 10666721 (which is 0b1mantissa).

Just forget the actual exponents and divide the mantissa of smaller exponent by 2^difference of exponents. Hence we get 9784241/(2^8)=38220 (round to nearest even) and 10666721

We can now add them to get `10704941 = 0b1010001101011

Solution

I took the liberty of rewriting your code:

#include 
#include 
#include 

#include 

#define min_float 0x00000000
#define max_float 0xffffffff

    #define exponent(x) (x > 24
    #define mantissa(x) (x > 9
    #define sign(x) x >> 31

uint32_t add(uint32_t x, uint32_t y) {
    uint32_t result_mantissa;
    uint32_t result_exponent;
    uint32_t result_sign;

    uint32_t different_sign = sign(x) ^ sign(y); //boolean but lets not do any type casting

    // catch NaN
    if (!(exponent(x) ^ 0xFF) && mantissa(x)) return x;
    if (!(exponent(y) ^ 0xFF) && mantissa(y)) return y;

    // catch Inf
    if (!(exponent(x) ^ 0xFF) && !(exponent(y) ^ 0xFF)) {
        // both are inf
        if (different_sign)
            // Inf - Inf
            return 0x7F800000 + 1; // NaN
        else
            // both Inf or -Inf
            return x;
    }
    else if (!(exponent(x) ^ 0xFF)) return x;
    else if (!(exponent(y) ^ 0xFF)) return y;

    // both numbers are non-special
    uint32_t exp_difference;
    if (different_sign) {
        exp_difference = exponent(y) + exponent(x);
    }
    else {
        // no need to account for constant BO
        // beware of underflow
        if (exponent(x) > exponent(y)) exp_difference = exponent(x) - exponent(y);
        else exp_difference = exponent(y) - exponent(x);
    }

    bool x_bigger_abs;
    if      (exponent(x) > exponent(y)) x_bigger_abs = true;
    else if (exponent(x)  mantissa(y)) x_bigger_abs = true;
    else                                x_bigger_abs = false;

    if (!different_sign) {
        //both numbers have same sign (this is a sum)
        result_sign = sign(x);

        if (x_bigger_abs) {
            result_mantissa = (mantissa(x) > exp_difference;
            result_exponent = exponent(x);
        }
        else {
            result_mantissa = (mantissa(y) > exp_difference);
            result_exponent = exponent(y);
        }
        if (result_mantissa > 1) + 1;
        else result_mantissa = (result_mantissa >> 1);
    }
    else {
        // this actually is a subtraction

        if (x_bigger_abs) {
            result_sign = sign(x);
            result_exponent = exponent(x);

            // subtract and round to 23 bit 
            // this means making room in our 32bit representation
            result_mantissa = (mantissa(x) > exp_difference );
        }
        else {
            result_sign = sign(y);
            result_exponent = exponent(y);

            // subtract and round to 23 bit 
            // this means making room in our 32bit representation
            result_mantissa = (mantissa(y) > exp_difference);
        }

        if (result_mantissa > 1) + 1);
        else result_mantissa = (result_mantissa >> 1);

        // normalize mantissa
        uint32_t temp = result_mantissa > 31)) {
                result_mantissa > std::dec >> i;
        iss >> std::hex >> a;
        iss >> std::hex >> b;
        iss >> std::hex >> c;
        iss.clear();
        if (c & add(a, b)) {
            num_passed++;
        }
        else {
            num_failed++;
        }
    }
    std::cout << "Hex test -- compared to file:  Total " << num_passed << " " << "PASSED " << num_failed << " FAILED." << std::endl;
    file.close();
}

The thoughts:

general

You are not handling the special cases of the IEEE754, which are Inf and NaN values. You will have to catch them in your add function and perform the according actions. NaN + X = NaN, Inf + -Inf = NaN, ...

main()

For what ever reason your way of reading the file caught me in an infinite loop, so I adjusted the reading.

I changed the long type variables to uint32_t. This is significant. C++ does not require any specific data model, but instead gives a set range in which a number has to be. So by using long you can end up getting a 2C, a 1C or even a uint64 with a sufficient BO, depending on your compiler (and ultimately your CPU). The tests perform accordingly:

Total 14 PASSED 26 FAILED. (using long == 2C on my machine)
Total 38 PASSED 2  FAILED. (using uint32_t)

The biggest problem comes when bitshifting (>) 2C or 1C numbers. With signed numbers the result is rarely usefull, but with unsigned numbers the result is a multiplication with 2^N. Latter is abused when shifting the mantissa to do addition / subtraction.

add

As mentioned by @Quuxplusone you are leaking memory because you never delete sum, a, b.

Same story with signed vs unsigned integers.

The code becomes a lot more readable, if you use preprocessor macros for the individual parts of float

#define exponent(x) (x > 24
#define mantissa(x) (x > 9
#define sign(x) x >> 31

it conveniently resolves the memory leakage and removes the need for the get function.

shiftAndRound()

Your rounding is not very clean. You know that the mantissa is at most 23 bit and you have access to 32 bit, just shift to the left by 1 to make room for the 24th bit (on which rounding depends)

```
...
result

Code Snippets

#include <fstream>
#include <iostream>
#include <sstream>

#include <random>

#define min_float 0x00000000
#define max_float 0xffffffff

    #define exponent(x) (x << 1) >> 24
    #define mantissa(x) (x << 9) >> 9
    #define sign(x) x >> 31

uint32_t add(uint32_t x, uint32_t y) {
    uint32_t result_mantissa;
    uint32_t result_exponent;
    uint32_t result_sign;

    uint32_t different_sign = sign(x) ^ sign(y); //boolean but lets not do any type casting

    // catch NaN
    if (!(exponent(x) ^ 0xFF) && mantissa(x)) return x;
    if (!(exponent(y) ^ 0xFF) && mantissa(y)) return y;

    // catch Inf
    if (!(exponent(x) ^ 0xFF) && !(exponent(y) ^ 0xFF)) {
        // both are inf
        if (different_sign)
            // Inf - Inf
            return 0x7F800000 + 1; // NaN
        else
            // both Inf or -Inf
            return x;
    }
    else if (!(exponent(x) ^ 0xFF)) return x;
    else if (!(exponent(y) ^ 0xFF)) return y;

    // both numbers are non-special
    uint32_t exp_difference;
    if (different_sign) {
        exp_difference = exponent(y) + exponent(x);
    }
    else {
        // no need to account for constant BO
        // beware of underflow
        if (exponent(x) > exponent(y)) exp_difference = exponent(x) - exponent(y);
        else exp_difference = exponent(y) - exponent(x);
    }


    bool x_bigger_abs;
    if      (exponent(x) > exponent(y)) x_bigger_abs = true;
    else if (exponent(x) < exponent(y)) x_bigger_abs = false;
    else if (mantissa(x) > mantissa(y)) x_bigger_abs = true;
    else                                x_bigger_abs = false;

    if (!different_sign) {
        //both numbers have same sign (this is a sum)
        result_sign = sign(x);

        if (x_bigger_abs) {
            result_mantissa = (mantissa(x) << 1) + (mantissa(y) << 1) >> exp_difference;
            result_exponent = exponent(x);
        }
        else {
            result_mantissa = (mantissa(y) << 1) + ((mantissa(x) << 1) >> exp_difference);
            result_exponent = exponent(y);
        }
        if (result_mantissa << 31) result_mantissa = (result_mantissa >> 1) + 1;
        else result_mantissa = (result_mantissa >> 1);
    }
    else {
        // this actually is a subtraction

        if (x_bigger_abs) {
            result_sign = sign(x);
            result_exponent = exponent(x);

            // subtract and round to 23 bit 
            // this means making room in our 32bit representation
            result_mantissa = (mantissa(x) << 1) - ((mantissa(y) << 1) >> exp_difference );
        }
        else {
            result_sign = sign(y);
            result_exponent = exponent(y);

            // subtract and round to 23 bit 
            // this means making room in our 32bit representation
            result_mantissa = (mantissa(y) << 1) - ((mantissa(x) << 1) >> exp_difference);
        }

        if (result_mantissa << 31)  result_mantissa = ((result_mantissa >> 1) + 1);
        else result_mantissa = (result_manti

Total 14 PASSED 26 FAILED. (using long == 2C on my machine)
Total 38 PASSED 2  FAILED. (using uint32_t)

#define exponent(x) (x << 1) >> 24
#define mantissa(x) (x << 9) >> 9
#define sign(x) x >> 31

...
result = result >> (d - 1)
if (result << 31) result = (result >> 1) + 1;//round up
else result = (result >> 1); //round down

//calculate with a 24-bit mantissa and round down to 23 bit
result = big_mantissa << 1 + (small_mantissa << 1) >> d
if (result << 31) result = (result >> 1) + 1; //round up
else result = (result >> 1); //round down

Context

StackExchange Code Review Q#139020, answer score: 4

Revisions (0)

No revisions yet.