patterncppMinor
Addition of two IEEE754-32bit single precision floating point numbers in C++
Viewed 0 times
32bitieee754pointfloatingnumbersadditionprecisiontwosingle
Problem
I made a C++ program which takes a list of triplets with S.No. from a text file, where the triplet, i.e. a,b,c correspond to a+b=c (addition done using float data type). Now I convert a and b from hex to 32-bit binary numbers and extract sign, mantissa and fractional part and then add them. Finally I convert the sum back to hexadecimal representation and compare to c.
Explanation:
[Note: '0b': binary '0x': hex]
Take the test case
Code:
Explanation:
[Note: '0b': binary '0x': hex]
Take the test case
4 be954bb1 c2a2c2e1 c2a3582d.- Now
a = 0xbe954bb1 = 0b10111110100101010100101110110001.
- And
b = 0xc2a2c2e1 = 0b11000010101000101100001011100001.
- In IEEE754 first bit is sign which is 1 for both hence both are negative. next eight bits are for exponent i.e.
0b01111101for a and0b10000101which correspond to125and133in decimal. These exponents have a offset of 127 so actual exponents are125-127=-2and133-127=6
- Rest bits are mantissa and the actual floating point number is
1.mantissa x 2^exponentwhere1.mantissais in binary. So our numbers are1.00101010100101110110001 x 2^-2and1.01000101100001011100001 x 2^6
- For adding we make the exponent same (the larger one, i.e. 6), hence we have
1.00101010100101110110001 x 2^-2 + 1.01000101100001011100001 x 2^6 = 0.0000000100101010100101110110001 x 2^6 + 1.01000101100001011100001 x 2^6 = ...
Code:
- Now I extracted the sign, magnitude and mantissa. Note that the above mantissa will correspond to
0b00101010100101110110001and0b01000101100001011100001and that is1395633and2278113in decimal.
- We are working with integers only so we will not multiply both by same factor, i.e.
2^-23to convert to.mantissarather we add2^23to both to get9784241and10666721(which is0b1mantissa).
- Just forget the actual exponents and divide the mantissa of smaller exponent by
2^difference of exponents. Hence we get9784241/(2^8)=38220(round to nearest even) and10666721
- We can now add them to get `10704941 = 0b1010001101011
Solution
I took the liberty of rewriting your code:
The thoughts:
general
You are not handling the special cases of the IEEE754, which are Inf and NaN values. You will have to catch them in your
main()
For what ever reason your way of reading the file caught me in an infinite loop, so I adjusted the reading.
I changed the
The biggest problem comes when bitshifting (
add
As mentioned by @Quuxplusone you are leaking memory because you never delete
Same story with signed vs unsigned integers.
The code becomes a lot more readable, if you use preprocessor macros for the individual parts of float
it conveniently resolves the memory leakage and removes the need for the
shiftAndRound()
Your rounding is not very clean. You know that the mantissa is at most 23 bit and you have access to 32 bit, just shift to the left by 1 to make room for the 24th bit (on which rounding depends)
```
...
result
#include
#include
#include
#include
#define min_float 0x00000000
#define max_float 0xffffffff
#define exponent(x) (x > 24
#define mantissa(x) (x > 9
#define sign(x) x >> 31
uint32_t add(uint32_t x, uint32_t y) {
uint32_t result_mantissa;
uint32_t result_exponent;
uint32_t result_sign;
uint32_t different_sign = sign(x) ^ sign(y); //boolean but lets not do any type casting
// catch NaN
if (!(exponent(x) ^ 0xFF) && mantissa(x)) return x;
if (!(exponent(y) ^ 0xFF) && mantissa(y)) return y;
// catch Inf
if (!(exponent(x) ^ 0xFF) && !(exponent(y) ^ 0xFF)) {
// both are inf
if (different_sign)
// Inf - Inf
return 0x7F800000 + 1; // NaN
else
// both Inf or -Inf
return x;
}
else if (!(exponent(x) ^ 0xFF)) return x;
else if (!(exponent(y) ^ 0xFF)) return y;
// both numbers are non-special
uint32_t exp_difference;
if (different_sign) {
exp_difference = exponent(y) + exponent(x);
}
else {
// no need to account for constant BO
// beware of underflow
if (exponent(x) > exponent(y)) exp_difference = exponent(x) - exponent(y);
else exp_difference = exponent(y) - exponent(x);
}
bool x_bigger_abs;
if (exponent(x) > exponent(y)) x_bigger_abs = true;
else if (exponent(x) mantissa(y)) x_bigger_abs = true;
else x_bigger_abs = false;
if (!different_sign) {
//both numbers have same sign (this is a sum)
result_sign = sign(x);
if (x_bigger_abs) {
result_mantissa = (mantissa(x) > exp_difference;
result_exponent = exponent(x);
}
else {
result_mantissa = (mantissa(y) > exp_difference);
result_exponent = exponent(y);
}
if (result_mantissa > 1) + 1;
else result_mantissa = (result_mantissa >> 1);
}
else {
// this actually is a subtraction
if (x_bigger_abs) {
result_sign = sign(x);
result_exponent = exponent(x);
// subtract and round to 23 bit
// this means making room in our 32bit representation
result_mantissa = (mantissa(x) > exp_difference );
}
else {
result_sign = sign(y);
result_exponent = exponent(y);
// subtract and round to 23 bit
// this means making room in our 32bit representation
result_mantissa = (mantissa(y) > exp_difference);
}
if (result_mantissa > 1) + 1);
else result_mantissa = (result_mantissa >> 1);
// normalize mantissa
uint32_t temp = result_mantissa > 31)) {
result_mantissa > std::dec >> i;
iss >> std::hex >> a;
iss >> std::hex >> b;
iss >> std::hex >> c;
iss.clear();
if (c & add(a, b)) {
num_passed++;
}
else {
num_failed++;
}
}
std::cout << "Hex test -- compared to file: Total " << num_passed << " " << "PASSED " << num_failed << " FAILED." << std::endl;
file.close();
}The thoughts:
general
You are not handling the special cases of the IEEE754, which are Inf and NaN values. You will have to catch them in your
add function and perform the according actions. NaN + X = NaN, Inf + -Inf = NaN, ...main()
For what ever reason your way of reading the file caught me in an infinite loop, so I adjusted the reading.
I changed the
long type variables to uint32_t. This is significant. C++ does not require any specific data model, but instead gives a set range in which a number has to be. So by using long you can end up getting a 2C, a 1C or even a uint64 with a sufficient BO, depending on your compiler (and ultimately your CPU). The tests perform accordingly:Total 14 PASSED 26 FAILED. (using long == 2C on my machine)
Total 38 PASSED 2 FAILED. (using uint32_t)The biggest problem comes when bitshifting (
>) 2C or 1C numbers. With signed numbers the result is rarely usefull, but with unsigned numbers the result is a multiplication with 2^N. Latter is abused when shifting the mantissa to do addition / subtraction.add
As mentioned by @Quuxplusone you are leaking memory because you never delete
sum, a, b.Same story with signed vs unsigned integers.
The code becomes a lot more readable, if you use preprocessor macros for the individual parts of float
#define exponent(x) (x > 24
#define mantissa(x) (x > 9
#define sign(x) x >> 31it conveniently resolves the memory leakage and removes the need for the
get function.shiftAndRound()
Your rounding is not very clean. You know that the mantissa is at most 23 bit and you have access to 32 bit, just shift to the left by 1 to make room for the 24th bit (on which rounding depends)
```
...
result
Code Snippets
#include <fstream>
#include <iostream>
#include <sstream>
#include <random>
#define min_float 0x00000000
#define max_float 0xffffffff
#define exponent(x) (x << 1) >> 24
#define mantissa(x) (x << 9) >> 9
#define sign(x) x >> 31
uint32_t add(uint32_t x, uint32_t y) {
uint32_t result_mantissa;
uint32_t result_exponent;
uint32_t result_sign;
uint32_t different_sign = sign(x) ^ sign(y); //boolean but lets not do any type casting
// catch NaN
if (!(exponent(x) ^ 0xFF) && mantissa(x)) return x;
if (!(exponent(y) ^ 0xFF) && mantissa(y)) return y;
// catch Inf
if (!(exponent(x) ^ 0xFF) && !(exponent(y) ^ 0xFF)) {
// both are inf
if (different_sign)
// Inf - Inf
return 0x7F800000 + 1; // NaN
else
// both Inf or -Inf
return x;
}
else if (!(exponent(x) ^ 0xFF)) return x;
else if (!(exponent(y) ^ 0xFF)) return y;
// both numbers are non-special
uint32_t exp_difference;
if (different_sign) {
exp_difference = exponent(y) + exponent(x);
}
else {
// no need to account for constant BO
// beware of underflow
if (exponent(x) > exponent(y)) exp_difference = exponent(x) - exponent(y);
else exp_difference = exponent(y) - exponent(x);
}
bool x_bigger_abs;
if (exponent(x) > exponent(y)) x_bigger_abs = true;
else if (exponent(x) < exponent(y)) x_bigger_abs = false;
else if (mantissa(x) > mantissa(y)) x_bigger_abs = true;
else x_bigger_abs = false;
if (!different_sign) {
//both numbers have same sign (this is a sum)
result_sign = sign(x);
if (x_bigger_abs) {
result_mantissa = (mantissa(x) << 1) + (mantissa(y) << 1) >> exp_difference;
result_exponent = exponent(x);
}
else {
result_mantissa = (mantissa(y) << 1) + ((mantissa(x) << 1) >> exp_difference);
result_exponent = exponent(y);
}
if (result_mantissa << 31) result_mantissa = (result_mantissa >> 1) + 1;
else result_mantissa = (result_mantissa >> 1);
}
else {
// this actually is a subtraction
if (x_bigger_abs) {
result_sign = sign(x);
result_exponent = exponent(x);
// subtract and round to 23 bit
// this means making room in our 32bit representation
result_mantissa = (mantissa(x) << 1) - ((mantissa(y) << 1) >> exp_difference );
}
else {
result_sign = sign(y);
result_exponent = exponent(y);
// subtract and round to 23 bit
// this means making room in our 32bit representation
result_mantissa = (mantissa(y) << 1) - ((mantissa(x) << 1) >> exp_difference);
}
if (result_mantissa << 31) result_mantissa = ((result_mantissa >> 1) + 1);
else result_mantissa = (result_mantiTotal 14 PASSED 26 FAILED. (using long == 2C on my machine)
Total 38 PASSED 2 FAILED. (using uint32_t)#define exponent(x) (x << 1) >> 24
#define mantissa(x) (x << 9) >> 9
#define sign(x) x >> 31...
result = result >> (d - 1)
if (result << 31) result = (result >> 1) + 1;//round up
else result = (result >> 1); //round down//calculate with a 24-bit mantissa and round down to 23 bit
result = big_mantissa << 1 + (small_mantissa << 1) >> d
if (result << 31) result = (result >> 1) + 1; //round up
else result = (result >> 1); //round downContext
StackExchange Code Review Q#139020, answer score: 4
Revisions (0)
No revisions yet.