floating point

The float point contain three parts:

  1. signed bit
  2. exponent bits (E)
  3. fraction bits (M)
(1)
\begin{equation} V = (-1)^s*M*2^E \end{equation}

There are two kinds of floating point:

  • normalization
  • denormalization

normalization

when e is not all 1s or all 0s.

$E = e - Bias$
$bias = 2^{k-1} - 1$
$M = 1.f_{n-1}...f_1f_0$

denormalization

when e is all 0s

$E = 1 - Bias$
$M = f$ with no leading 1

when e is all 1s

when f=0, then it is $\infty$.
otherwise it is NaN (Not a number)

The following example is when 8bits floating point with k=4 n=3 bias = 7. You can see that it's easy to compare each number.

Description Bit representation e E f M V
Zero 0 0000 000 0 -6 0 0 0
Smallest denormalization 0 0000 001 0 -6 1/8 1/8 1/512
0 0000 010 0 -6 2/8 2/8 2/512
0 0000 011 … 0 -6 3/8 3/8 3/512
0 0000 110 0 -6 6/8 6/8 6/512
Largest denormalization 0 0000 111 0 -6 7/8 7/8 7/512
Smallest normalization 0 0001 000 1 -6 0 8/8 8/512
0 0001 001 … 1 -6 1/8 9/8 9/512
0 0110 110 6 -1 6/8 14/8 14/16
0 0110 111 6 -1 7/8 15/8 15/16
One 0 0111 000 7 0 0 8/8 1
0 0111 001 7 0 1/8 9/8 9/8
0 0111 010 … 7 0 2/8 10/8 10/8
0 1110 110 14 7 6/8 14/8 224
Largest norm. 0 1110 111 14 7 7/8 15/8 240
Infinity 0 1111 000 - - - - $\infty$
c
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License