patterncsharpMinor

Half precision reader/writer for C#

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

writerprecisionforhalfreader

Problem

I'm reading/writing half precision floating point numbers in C#. These are basically 16 bit floats, compared to the usual 32/64 bit floats and doubles we are used to working with.

I've taken some highly tested Java code from an obvious "expert on the subject" here and modified it to work with C#. Is this correct?

```
// ignores the higher 16 bits
public static float toFloat( int hbits )
{
int mant = hbits & 0x03ff; // 10 bits mantissa
int exp = hbits & 0x7c00; // 5 bits exponent
if( exp == 0x7c00 ) // NaN/Inf
exp = 0x3fc00; // -> NaN/Inf
else if( exp != 0 ) // normalized value
{
exp += 0x1c000; // exp - 15 + 127
if( mant == 0 && exp > 0x1c400 ) // smooth transition
return BitConverter.ToSingle(BitConverter.GetBytes( ( hbits & 0x8000 ) subnormal
{
exp = 0x1c400; // make it normal
do {
mant +/-0
return BitConverter.ToSingle(BitConverter.GetBytes( // combine all parts
( hbits & 0x8000 ) >> 16 & 0x8000; // sign only
int val = ( fbits & 0x7fffffff ) + 0x1000; // rounded value

if( val >= 0x47800000 ) // might be or become NaN/Inf
{ // avoid Inf due to rounding
if( ( fbits & 0x7fffffff ) >= 0x47800000 )
{ // is or must become NaN/Inf
if( val >> 13; // keep NaN (and Inf) bits
}
return sign | 0x7bff; // unrounded not quite Inf
}
if( val >= 0x38800000 ) // remains normalized value
return sign | val - 0x38000000 >>> 13; // exp - 127 + 15
if( val >> 23; // tmp exp for subnormal calc
return sign | ( ( fbits & 0x7fffff | 0x800000 ) // add subnormal bit
+ ( 0x800000 >>> val - 102 ) // round depending on cut off
>>> 126 - val ); // div by 2^

Solution

Is this correct?

Maybe.

For a start, the >>> operator doesn't exist in C#.

Taking a guess, I replaced >>> with >> and then wrote the following 'unit test' for it:

static void assertFloat(float fval)
    {
        int i = fromFloat(fval);
        float f2 = toFloat(i);
        if (fval != i)
            throw new ApplicationException();
    }

    static void Main(string[] args)
    {
        assertFloat(0);
        assertFloat(1);
        assertFloat(0.5f);
        assertFloat(-0.5f);
        assertFloat(-0);
        assertFloat(float.PositiveInfinity);
        assertFloat(float.NaN);
        float big = 1024 * 1024;
        big *= big;
        assertFloat(big);
    }

It failed the second test: because 1.0 is round-trip-converted to 1.000122.

I am disappointed by an encoding scheme which cannot encode '1.0' exactly.

However you didn't say how exact you expect the round-trip to be, so I don't know whether it's correct.

The following is a list of some input with corresponding output:

0 -> 0
1 -> 1.000122
1.1 -> 1.099609
-1 -> -1.000122
0.5 -> 0.500061
-0.5 -> -0.500061
0.001 -> 0.001000404
5.5 -> 5.5
5.6 -> 5.601563
5.7 -> 5.699219
0 -> 0
Infinity -> Infinity
NaN -> NaN
1024 -> 1024.125
1048576 -> Infinity

So it seems approximately correct for those numbers.

I can't say whether it's the best encoding. For example, this encoding gains the ability to express decimals but loses the ability to express integers (they become approximated) and big integers (they become infinity).

A different encoding scheme could be devised (and might be more useful depending on your application) which cannot express decimals but which gains the ability to express (approximately) some numbers which are bigger-than-the-biggest integer.

There are a lot of 'magic numbers' (i.e. hard-coded constants) in the code. To inspect for correctness I would need to guess/reverse engineer the way in which you encode/use the 16 bits for your "half precision float" numbers. You could make it easier by documenting the format using comments: which bits do you use for what?

Note this comment in the post you linked to:

I see what you mean but these NaN values wont be returned from Float.floatToIntBits which normalizes all NaNs to 0x7fc00000. The rounded val can thus never become nagative. Maybe it would be faster to use floatToRawIntBits (which does not do NaN normalization) and then deal with the overflow NaNs i.e. by adding || val < 0 to the first branch. This would also allow to preserve some of the extra NaN bits. I remember that I had planned to do this but couldn't find sufficient documentation on how to handle these bits and thus settled with normalized NaNs.

The BitConverter which you use may not (I haven't tested it) normalize NaN values in the same way.

The original OP linked to an spec for Half-precision floating-point format which says,

Integers between 0 and 2048 can be exactly represented

So my testing, which shows error in round-tripping 1, suggests that this spec is NOT implemented correctly.

When I test it, 1.0f is encoded as 0x3c00 which is correct according to the spec. So the bug is presumably being introduced in the toFloat method, specifically this statement:

return BitConverter.ToSingle(BitConverter.GetBytes( ( hbits & 0x8000 ) << 16
                                        | exp << 13 | 0x3ff ), 0);

This is a line which the "obvious expert on the subject" said they "implemented [as a] small extension compared to the book".

IOW you may have made a translation from the Java, but the Java doesn't correctly/fully implement the spec (it tries to improve on the spec, perhaps resulting in an inability to accurately decode small integers).

Code Snippets

static void assertFloat(float fval)
    {
        int i = fromFloat(fval);
        float f2 = toFloat(i);
        if (fval != i)
            throw new ApplicationException();
    }

    static void Main(string[] args)
    {
        assertFloat(0);
        assertFloat(1);
        assertFloat(0.5f);
        assertFloat(-0.5f);
        assertFloat(-0);
        assertFloat(float.PositiveInfinity);
        assertFloat(float.NaN);
        float big = 1024 * 1024;
        big *= big;
        assertFloat(big);
    }

0 -> 0
1 -> 1.000122
1.1 -> 1.099609
-1 -> -1.000122
0.5 -> 0.500061
-0.5 -> -0.500061
0.001 -> 0.001000404
5.5 -> 5.5
5.6 -> 5.601563
5.7 -> 5.699219
0 -> 0
Infinity -> Infinity
NaN -> NaN
1024 -> 1024.125
1048576 -> Infinity

return BitConverter.ToSingle(BitConverter.GetBytes( ( hbits & 0x8000 ) << 16
                                        | exp << 13 | 0x3ff ), 0);

Context

StackExchange Code Review Q#45007, answer score: 3

Revisions (0)

No revisions yet.