IEEE754 floating-point representation

Number representation

Floating point system sign bit exponent bit fraction bit(mantissa)
32-bits 1 8 23
64-bits 1 11 52

Following are representations of in different floating point system. (s stands for sign, e for exponent, f for fraction)

32bits:

seeeeeeeefffffffffffffffffffffff
00111111100000000000000000000000

64bits:

seeeeeeeeeeeffffffffffffffffffffffffffffffffffffffffffffffffffff
0011111111110000000000000000000000000000000000000000000000000000

Calculating floating point value

As we can image, the way to calculate the a floating point value is simply like: where for sign bit: 0 = positive number, 1 = negative number

But there are a few things to notice:

First: In order to introduce small numbers like 0.0000123 into floating point system, we have to allow exponent to be negative. Therefore, the exponent part has bias, which equals to where is number of exponent bits in the floating point system, and the actual exponent representation equals to where the bias in 32-bit system is 127, and that in 64-bit system is 1023.

Therefore, exponent 0 in 32-bit system is like:

01111111

Second: A nice little optimization is now available to us in base two, since binary has only one possible non-zero digit: 1. Thus, we can just assume a leading digit of 1, and don’t need to store it in the floating-point representation. As a result, we can assume a leading digit of 1 without storing it, so that a 32-bit floating-point value effectively has 24 bits of mantissa: 23 explicit fraction bits plus one implicit leading bit of 1.1

As the result, the real value of fraction part equals to: because we have an “invisible” at the end of the mantissa.

Therefore, fraction 1 in 32-bit system is just an all-zero string.

00000000000000000000000

So to conclude, we can generalize the following equation to calculate the real value of of a floating-point number:
1. for 32-bit float: 2. for 64-bit float: Exceptions

Zero

Condition: When exponent part = 0 and fraction part = 0.

Since mantissa part always assumes an “1” at the end of fraction part, we defaults zero to be a number with exponent part = 0 and fraction part = 0.

Additionally, due to the existence of sign bit, there exists two zeros in floating number: +0 and -0, which are represented differently in bit level.

Infinity

Condition: When exponent part = and fraction part = 0.

Not A Number

Condition: When exponent part = and fraction part is not 0

Special Operations

Operation Result 0           NaN NaN NaN NaN NaN

1. https://steve.hollasch.net/cgindex/coding/ieeefloat.html