patternMinor
Logic behind choosing the exponent bias as $2^7 -1$ instead of $2^7$ in $32$ bits IEEE-754 floating point representation
Viewed 0 times
exponentthebiaspointbitsfloatinglogicieeeinsteadbehind
Problem
The $\text{IEEE-754}$ uses $32$ bits to represent single precision floating point numbers. The partitions of the register are as follows:
Now in general $8$ bits can represent $2's$ complement numbers in the range $-2^7$ to $2^7 -1$. Now the bias is added to aid in comparison purposes. In general to make the entire range non negative we can add a bias of $2^7$ to the exponent so that the range becomes $0$ to $2^8 -1$. But again the IEEE uses implicit normalization of mantissa as a result of which we need explicit representation of $0$ or $\pm \infty$. So biased exponent $E'=0$ or $255$ is reserved for this purpose.
So we have,
$$1 \leq E' \leq 254$$
$$\implies 1 \leq E+127 \leq 254$$
$$ ,\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\text{(where E is the actual exponent)}$$
$$\implies -126 \leq E \leq 127$$
But had we used a bias of $128$ instead of $127$ then after reserving $E'=0 \text{ or } 255$
We would have got $$ -127 \leq E \leq 126$$
So my question is, Is there any technical reason for choosing $2^7-1$ instead of $2^7$ or it is just a convention followed by IEEE? Because using the bias of $128$ we could as well allow the required reservation for the special numbers, I see that the only difference comes in the range of the exponent, which shall become $[-127,126]$ instead of $[-126, 127]$
Biased
sign Exponent Mantissa
+-----+--------+---------+
|1 bit| 8 bits | 23 bits |
+-----+--------+---------+Now in general $8$ bits can represent $2's$ complement numbers in the range $-2^7$ to $2^7 -1$. Now the bias is added to aid in comparison purposes. In general to make the entire range non negative we can add a bias of $2^7$ to the exponent so that the range becomes $0$ to $2^8 -1$. But again the IEEE uses implicit normalization of mantissa as a result of which we need explicit representation of $0$ or $\pm \infty$. So biased exponent $E'=0$ or $255$ is reserved for this purpose.
So we have,
$$1 \leq E' \leq 254$$
$$\implies 1 \leq E+127 \leq 254$$
$$ ,\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\text{(where E is the actual exponent)}$$
$$\implies -126 \leq E \leq 127$$
But had we used a bias of $128$ instead of $127$ then after reserving $E'=0 \text{ or } 255$
We would have got $$ -127 \leq E \leq 126$$
So my question is, Is there any technical reason for choosing $2^7-1$ instead of $2^7$ or it is just a convention followed by IEEE? Because using the bias of $128$ we could as well allow the required reservation for the special numbers, I see that the only difference comes in the range of the exponent, which shall become $[-127,126]$ instead of $[-126, 127]$
Solution
Jerome T. Coonen, "An Implementation Guide to a Proposed Standard for Floating-Point Arithmetic," Computer, Vol. 13, No. 1, January 1980, pp. 68-79
provides the following rationale:
Because of the care taken in the treatment of Underflows, the range of
normalized numbers in single, double, and quad formats has been
chosen to diminish slightly the risk of Overflow compared with the
risk of Underflow. This was done by picking the exponent bias and
alignment of the binary point in the significant digit field in such a
way that the product of the largest and smallest positive normalized
numbers is roughly 4 in each of the basic formats.
David Stevenson, "A Proposed Standard for Binary Floating-Point Arithmetic," Computer, Vol. 14, No. 3, March 1981, pp. 51-62
provides a somewhat different but related reasoning:
For the 32-bit format, precision was deemed the most important
criterion, hence the choice of radix 2 instead of octal or
hexadecimal. Other characteristics include not representing the
leading significand bit in normalized numbers, a minimally acceptable
exponent range which uses eight bits, and an exponent bias which
allows the reciprocal of all normalized numbers to be represented
without overflow.
David Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic", ACM Computing Surveys, Vol. 23, No. 1, March 1991, pp. 5-48
combines the above two explanations as follows:
Referring to Table 1, single precision has emax = 127 and
emin= -126. The reason for having |emin| max is so that the reciprocal of the smallest number (1/2emin) will not overflow. Although it is
true that the reciprocal of the largest number will underflow,
underflow is usually less serious than overflow.
provides the following rationale:
Because of the care taken in the treatment of Underflows, the range of
normalized numbers in single, double, and quad formats has been
chosen to diminish slightly the risk of Overflow compared with the
risk of Underflow. This was done by picking the exponent bias and
alignment of the binary point in the significant digit field in such a
way that the product of the largest and smallest positive normalized
numbers is roughly 4 in each of the basic formats.
David Stevenson, "A Proposed Standard for Binary Floating-Point Arithmetic," Computer, Vol. 14, No. 3, March 1981, pp. 51-62
provides a somewhat different but related reasoning:
For the 32-bit format, precision was deemed the most important
criterion, hence the choice of radix 2 instead of octal or
hexadecimal. Other characteristics include not representing the
leading significand bit in normalized numbers, a minimally acceptable
exponent range which uses eight bits, and an exponent bias which
allows the reciprocal of all normalized numbers to be represented
without overflow.
David Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic", ACM Computing Surveys, Vol. 23, No. 1, March 1991, pp. 5-48
combines the above two explanations as follows:
Referring to Table 1, single precision has emax = 127 and
emin= -126. The reason for having |emin| max is so that the reciprocal of the smallest number (1/2emin) will not overflow. Although it is
true that the reciprocal of the largest number will underflow,
underflow is usually less serious than overflow.
Context
StackExchange Computer Science Q#131949, answer score: 3
Revisions (0)
No revisions yet.