snippetMinor
Floating point format: why must `1−emax ≤ q+p−1 ≤ emax`?
Viewed 0 times
whyformatmustpointfloatingemax
Problem
From the Wikipedia page on the IEEE Standard for Floating-Point Arithmetic,
The possible finite values that can be represented in a format are determined by the base (b), the number of digits in the significand (precision, p), and the exponent (
...
q must be an integer such that
I can't figure out the reasoning behind the above inequality. I would've thought (in my simplicity) that it would be
The possible finite values that can be represented in a format are determined by the base (b), the number of digits in the significand (precision, p), and the exponent (
q) parameter emax:...
q must be an integer such that
1−emax ≤ q+p−1 ≤ emax (e.g., if p=7 and emax=96 then q is −101 through 90).I can't figure out the reasoning behind the above inequality. I would've thought (in my simplicity) that it would be
-emax ≤ q ≤ emax or something similar. What am I missing?Solution
The reason we get a larger range is denormalized numbers. Generally speaking, floating point numbers have three physical parts: sign (1 bit), mantissa $M$ and exponent $e$. Most of the time we think of the number as $\operatorname{sgn} \times 1.M \times 2^{e-e_0}$, where $e_0 = 2^{|e|}-1$ (e.g. for single precision, it's 127, since the exponent is allotted seven bits). Here "$1.M$" means the number you obtain by writing $M$ as a binary string and prefixing $1.$.
For reasons having to do with underflow (non-zero numbers turning to zero), it is important to be able to store numbers very close to zero. These numbers, named denormalized numbers or subnormalized numbers, have $e = 0$ and represent $\operatorname{sgn} \times 0.M \times 2^{1-e_0}$. (This means that normal numbers cannot have $e = 0$.) This explains the extended range mentioned in the Wikipedia page.
Other numbers having special encodings are $\pm \infty$ and NaN (not a number), which represent some illegal operation (division by zero, taking the logarithm of a non-positive number, taking the square root of a negative number, and so on). For more details, consult the Wikipedia page regarding the original standard.
For reasons having to do with underflow (non-zero numbers turning to zero), it is important to be able to store numbers very close to zero. These numbers, named denormalized numbers or subnormalized numbers, have $e = 0$ and represent $\operatorname{sgn} \times 0.M \times 2^{1-e_0}$. (This means that normal numbers cannot have $e = 0$.) This explains the extended range mentioned in the Wikipedia page.
Other numbers having special encodings are $\pm \infty$ and NaN (not a number), which represent some illegal operation (division by zero, taking the logarithm of a non-positive number, taking the square root of a negative number, and so on). For more details, consult the Wikipedia page regarding the original standard.
Context
StackExchange Computer Science Q#21930, answer score: 3
Revisions (0)
No revisions yet.