debugMinor

Union of fixed and floating point types

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

numerical-algorithms dependent-types cs stackoverflow floating-point typing

pointfloatingunionandtypesfixed

Problem

Say I have two real number types. They may be floating or fixed point. How can I construct a new type whose values are at least the union of the two with the minimal number of bits?

There are 3 cases to consider:

Fixed (Qa.x) $\cup$ Fixed (Qb.y) - I think the best here is to use Qmax(a,b).max(x,y). I think this is optimal since I can't come up with anything smaller that will accurately represent the type.

Float (FaEx) $\cup$ Float (FbEy) - I think the best here is to use Fmax(a,b)Emax(x,y). Again I can't think of a more optimal solution.

I am using Q notation for representing fixed point types. I don't know how floating point types are typically represented; I'm using an analogous representation where FaEx means a bits of mantissa and x bits of exponent.

The difficult case is:

Fixed (Qa.x) $\cup$ Float (FbEy) - The best I can come up with is Qmax(a,n).max(x,m) where n is the minimal bits to represent the biggest number the float can be and m is the minimal number of bits to represent the smallest positive fraction the float can be. This seems extremely inefficient as it extends the floating point's most accurate precision to its entire range. Thus for any decent sized floating point type the resulting union type will be extremely large.

Here are some ASCII diagrams of the three cases (simplified), and why I think I'm wrong:

```
0/4 1/4 2/4 3/4 4/4 5/4 6/4 7/4 8/4 9/4 10/4 11/4 12/4 13/4 14/4 15/4 16/4
| . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . | . v . |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Q1.2 |.......|.......|.......|.......|.......|.......|.......|........................................................................
U ..............................................................................................

Solution

I actually think the simplest scheme would just to prepend a signal bit that says that the rest of the bits are either a floating point or fixed point value.

This always comes with a fixed cost of 1 bit. In your notation, $\max(a + x, b + y) + 1$.

I believe the only way you can do better (you can't save fractional bits without going into entropy coding in a larger context, which is a whole other story) is if one of the types fully contains the other, in which case you can simply ignore the other type.

Context

StackExchange Computer Science Q#90304, answer score: 2

Revisions (0)

No revisions yet.