debugMinor

Proof that a guard digit bound the error of subtraction

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

boundtheerrorsubtractionthatguarddigitproof

Problem

I was reading What Every Computer Scientist Should Know About Floating-Point Arithmetic, which is extremely interesting. But I have some troubles understanding the proof of Theorem 9 (page 33).

First a pretty trivial question. When the formula $(15)$ say:
$$y - \bar{y} \lt (\beta - 1)(\beta^{-p} + \dots + \beta^{-p-k})$$
Shouldn't it be $\le$ instead of $\lt$, or did I miss something?

More importantly, I do not understand why it say that if $x-\bar{y} \lt 1$, then $\delta = 0$. How can there be no rounding error?

It is then said that:
$$x - y \ge 1.0 - 0.\overbrace{0 \dots 0}^k\overbrace{\rho \dots \rho}^k \; \textrm{with} \; \rho = \beta - 1$$

Why is that so? Can't the difference be arbitrarily small or even $0$? Why would there be as many $\rho$'s as there are $0$'s?

Solution

I'll post some elements of answer to my own question from what I understand. Anyone feel free to make a better, less sloppy answer.

First, there are several versions of this document online. They all have some typographic defects at some point. So when in doubt, better check those 3 versions first.

https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

https://ece.uwaterloo.ca/~dwharder/NumericalAnalysis/02Numerics/Double/paper.pdf

Second, there's a lot of implicit stuff going on with this first actual proof. So let's clarify some of them.

Reachable bound of $y - \bar{y}$

I think that, no, the bound on $y - \bar{y}$ cannot be reached. Let's try to go for the worst case with $k = p+1$. Let's assume $p = 6$ and the radix $\beta = 10$.
\begin{align}
y & = \overbrace{0.000000}^{k=7}\overbrace{999999}^{p=6} \\
\bar{y} & = \overbrace{0.000000}^{p+1=7} \\
y - \bar{y} & = 0.000000999999 \\
& = 9 \times (10^{-7}+\dots+10^{-12}) \\
& = (\beta - 1)(\beta^{-p-1}+\dots+\beta^{-p-p}) \\
& \lt (\beta - 1)(\beta^{-p-1}+\dots+\beta^{-p-p}+\beta^{-p-(p+1)}) \\
& \lt (\beta - 1)(\beta^{-p-1}+\dots+\beta^{-p-p}+\beta^{-p-k})
\end{align}

I had to add an extra term to the sum (i.e. an extra digit) to get the same formula as the paper. Hence the $=$ got transformed to $\lt$.

This exact bound simplify the second case of the proof. Although a looser and simpler bound $y - \bar{y} \lt \beta^{-p}$ would be enough for the first case when $x - y \ge 1$.

No rouding error when $x - \bar{y} \lt 1$

The assertion that if $x - \bar{y} \lt 1$, then $\delta = 0$ seems true. The rounding error $\delta$ is the error we introduce by removing the guard digit to produce the final result. This is sometimes needed because $\bar{y}$ has $p+1$ digits and $x - \bar{y}$ might as well have $p+1$ digits.

For instance:
\begin{align}
x & = 2.00000 \\
y & = 0.0123456 \\
\bar{y} & = 0.012345 \\
x - \bar{y} & = 1.987655
\end{align}
The result has $7$ digits therefore has to be rounded to $6$ digits.

But when $x - \bar{y} \lt 1$ then the result starts with the digit $0$. Meaning that at most only $p$ significant digits remain from the $p+1$ digits used to perform the subtraction. For example:
\begin{align}
x & = 1.00000 \\
y & = 0.0123456 \\
\bar{y} & = 0.012345 \\
x - \bar{y} & = 0.987655
\end{align}
The result $x - \bar{y}$ already has 6 digits. So no rounding error need to be introduced.

However, it is worth noting that this doesn't change anything to the error introduced by truncating $y$ to $\bar{y}$.

Minimum value of $x - y$

When the paper say that the smallest value $x - y$ can take is:
$$1.0 - 0.\overbrace{0 \dots 0}^k\overbrace{\rho \dots \rho}^k$$
there's actually a typo that has been corrected in other documents. What was ment to be written is:
$$1.0 - 0.\overbrace{0 \dots 0}^k\overbrace{\rho \dots \rho}^p$$
which make more sens and is the general formula when $k$ is fixed. Maybe a better form could be:
\begin{align}
x - y & \ge 1.0 - 0.\overbrace{0 \dots 0}^k\overbrace{\rho \dots \rho}^p \\
& \ge 0.\overbrace{\rho \dots \rho}^k\overbrace{0 \dots 0}^{p-1}1 \\
& \gt 0.\overbrace{\rho \dots \rho}^k
\end{align}

Which lead naturally to the formula written in the paper.

Third case: $x - \bar{y} = 1$

And finally, another implicit point of this proof (not asked but worth writing down): why would having both $x - y \lt 1$ and $x - \bar{y} \ge 1$ imply that $x - \bar{y} = 1$?

I think the informal answer would be that since $x$ is a float with $p$ digits, then $x - 1$ is also a float with $p$ or $p - 1$ significant digits (if $x >= 2$ or $x x - 1$ and $x - 1$ is a candidate value for $\bar{y}$ and is less than $y$.

In other words: $x - y \lt 1$ imply that $x - \bar{y} \le 1$. Therefore, having this and $x - \bar{y} \ge 1$ at the same time imply that $x - \bar{y} = 1$.

Context

StackExchange Computer Science Q#106152, answer score: 2

Revisions (0)

No revisions yet.